No description has been provided for this image

Kaggle Competition Submission: Titanic: Machine Learning from Disaster¶

  • Author: Paul Tongyoo
  • Contact: Message me on LinkedIn
  • Date: May 20, 2025
  • Official Competition Page: Titanic: Machine Learning from Disaster
  • Latest Submission Score (Accuracy): 0.7727 (Top 39% of 15995 entries)

(Work in progress)

Table of Contents¶

  1. Project Summary
    1. What I Did
    2. What I Learned
    3. What's Next
  2. Introduction
  3. Methodology
    1. Data Understanding
      1. Data Dictionary
      2. Variable Notes
      3. Descriptive Statistics
      4. Row Samples
      5. Data Types
      6. Missing Values Summary
    2. Data Preparation
      1. Missing Value Imputation
        1. Embarked
        2. Cabin
        3. Age
        4. Fare
    3. Exploratory Data Analysis
      1. Target
      2. Individual Features x Target
        1. Pclass
        2. Sex
        3. SibSp
        4. Parch
        5. Embarked_
        6. HasCabin
        7. Cabin_count
        8. Cabin_Location_s
        9. Deck
        10. Title
        11. Age_
        12. Age_Group
        13. Fare_
        14. Summary of Single Feature Relationship with Target
      3. Composite Feature x Target
        1. Pclass x Sex
        2. Pclass x Title
        3. Pclass x Parch
        4. Pclass x SibSp
        5. Sex x Parch
        6. Sex x SibSp
        7. Pclass x Embarked
        8. Sex x Embarked
        9. Pclass x HasCabin
        10. Sex x HasCabin
        11. Parch x HasCabin
        12. SibSp x HasCabin
        13. Embarked x HasCabin
        14. Pclass x Cabin_count
        15. Sex x Cabin_count
        16. Pclass x Cabin_Location_s
        17. Sex x Cabin_Location_s
        18. Pclass x Deck_bin
        19. Sex x Deck_bin
        20. Parch x Deck_bin
        21. SibSp x Deck_bin
        22. Deck x Cabin_Location_s
        23. Pclass x Title_bin
        24. Sex x Title_bin
        25. Pclass x Age_Group
        26. Sex x Age_Group
        27. Pclass x FPP_log_bin
        28. Sex x FPP_log_bin
        29. Pclass x Parch_SibSp
        30. Sex x Parch_SibSp
        31. HasCabin x Parch_SibSp
      4. Hi-Cardinality Features
        1. Ticket
      5. Feature Priority Based on EDA
    4. Cross-Fold Distribution Shift Analysis
    5. Feature Engineering
      1. Reduce Distribution Shift of Select Features
        1. Pclass x Age_Group
        2. Pclass_HasCabin
        3. Sex x HasCabin
        4. Embarked x HasCabin
        5. Parch_SibSp_bin
        6. HasCabin x Parch_SibSp_bin
        7. Pclass x Parch_SibSp_bin
        8. Sex x Parch_SibSp_bin
        9. Pclass x Embarked
        10. Sex x Embarked
        11. Pclass x Deck_bin
        12. Pclass x Cabin_Location_s
        13. Pclass x Normalized Title
        14. Deck_bin
        15. Title_normalized
      2. Pclass_Sex One-Hot Encodings
      3. Survival Association Tests
        1. Global Feature Survival Association Tests
        2. Pclass x Sex Subgroup Feature Survival Association Tests
        3. Survival Association Test Strategy and Results
      4. Smoothed Survival Rate Feature Engineering
        1. Generate Global Smoothed Features
        2. Is_Shared_Ticket
    6. Model Development
      1. Baseline Establishment
        1. Predict Majority Class
        2. Predict Simple Model
      2. Engineered Features Test
    7. Hyperparameter Tuning
      1. Out-of-Fold Prediction Mistake Analysis
      2. SHAP Analysis of Mistakes
  4. Submission
  5. References

Project Summary¶

This project tackles the classic Kaggle challenge: predicting passenger survival on the Titanic using machine learning. It serves as a hands-on exercise in feature engineering, model development, and interpretability within a well-known dataset, allowing for deep exploration of structured data analysis and model evaluation techniques.


What I Did¶

  • Conducted extensive feature engineering, combining domain knowledge and statistical validation to create globally smoothed target-encoded features and subgroup-specific smoothed features.
  • Designed features around Pclass-Sex cohorts using conditional masking, binning, and normalization strategies (e.g., Pclass_Title_normalized, Age_Group, Deck_bin).
  • Evaluated features using Chi-Squared tests, Cramér’s V, and cross-fold KL divergence, prioritizing variables with consistent distributions and statistically significant survival associations.
  • Built a high-performing XGBoostClassifier pipeline, using ablation testing, SHAP plots, and hyperparameter tuning (e.g., max_depth, min_child_weight, gamma, reg_alpha) to balance accuracy and generalization. Achieved a mean average accuracy of 0.8114 across cross-validation of the Kaggle training data set.

What I Learned¶

  • High cross-validation accuracy on the training set doesn't always translate to strong performance on the unseen test set — my 0.8114 CV accuracy dropped to 0.7727 on Kaggle's hidden test set. This highlighted how easily models can overfit to patterns specific to the training distribution, especially when engineered features subtly leak group identity or encode rare patterns that don't generalize.
  • Smoothed rate encoding can outperform one-hot encoding when group sizes are sufficiently supported and carefully regularized to avoid leakage.
  • Subgroup relevance masking (e.g., using Pclass_Sex) is tricky to enforce in practice; even zeroing out or setting NaNs doesn't fully eliminate feature leakage in tree-based models.
  • KL divergence is a powerful diagnostic for assessing train–validation distribution shifts, especially for engineered composite features.
  • SHAP plots revealed how certain features (e.g., P3_Female_Embarked_smoothed, Deck_bin) mislead the model when data sparsity or masking wasn't properly handled.
  • Small gains in accuracy sometimes come at the cost of generalizability. Monitoring standard deviation in CV scores became as important as mean accuracy.

What's Next¶

  • Implement model ensembling to incorporate predictions from a complimentary model (e.g. LogisticRegression) to improve generalization to unseen data.
  • Continue experimenting with features and model configurations to improve accuracy score.
  • Continue refining SHAP-driven debugging workflows to triage false positives/negatives and identify data segments where the model is overconfident or blind.

Introduction¶

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, Kaggle asked us to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Methodology¶

In [12]:
from datetime import datetime
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype, IntervalDtype
from scipy.stats import entropy
from scipy.special import rel_entr
import math
from scipy.stats import chi2_contingency
from IPython.display import display
from itertools import combinations
from collections import defaultdict
import IPython

import seaborn as sns
sns.set_theme()
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
custom_cmap = LinearSegmentedColormap.from_list("survival_cmap", ["tomato", "lightblue"])

import re
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, StandardScaler  
from sklearn.inspection import permutation_importance
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.model_selection import learning_curve, validation_curve, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree, export_text
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import shap

Data Understanding¶

Kaggle provided passenger data split into two groups:

  • Training set (train.csv)
    • 891 rows, 12 features
  • Test set (test.csv)
    • 418 rows, 11 features (no Survived col)
In [15]:
# Kaggle.com env
#train_df = pd.read_csv('/kaggle/input/titanic/train.csv')
#test_df = pd.read_csv('/kaggle/input/titanic/test.csv')

# Local Env
train_df = pd.read_csv('./input/train.csv')
test_df = pd.read_csv('./input/test.csv')

Data Dictionary¶

Variable Definition Key, Example
Survived Survival 0 = No, 1 = Yes
PassengerId Integer index of passenger 0,1,2,3,...
Name Name of passenger including title Braund, Mr. Owen Harris
Pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Sex Sex female, male
Age Age in years 0.15, 2, 15
SibSp # of siblings / spouses aboard the Titanic 0,1,2,3,..
Parch # of parents / children aboard the Titanic 0,1,2,3,..
Ticket Ticket number, some with prefixes, shared among groups SW/PP 751
Fare Passenger fare, shared among groups 7.91, 14.4542, 512.3292
Cabin Cabin number(s) listed on ticket, prefixed with deck letter, shared among groups A20, "B57 B59 B63 B66"
Embarked Port of Embarkation C = Cherbourg (France), Q = Queenstown (Ireland), S = Southampton (England)

Variable Notes¶

pclass: A proxy for socio-economic status (SES)

  • 1st = Upper
  • 2nd = Middle
  • 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...

  • Sibling = brother, sister, stepbrother, stepsister
  • Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...

  • Parent = mother, father
  • Child = daughter, son, stepdaughter, stepson
  • Some children travelled only with a nanny, therefore parch=0 for them.

Descriptive Statistics¶

In [18]:
train_df.describe(include='number')
Out[18]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Initial inferences:

  • PassengerId: column is most likely a sequential integer index from 1 to 891
  • Survived: Mean survival rate for this data set is 38.4%
  • Pclass: Top 25% of passengers were in 1st and 2nd Class, remainder in 3rd class
  • Age: Youngest child was less than 6 months old, oldest is 80 years old. Median age 28 yrs, 75th pecentile 38 yrs. 177 values missing
  • SibSp / Parch: At least 50% of passengers traveled alone, 75% passengers had at most 1 sibling or spouse. Max family size is 8
  • Fare: Fares ranged from 0.00 (potentially crew?) to 512.32. Median fare 14.45, 75th percentile 31.00 - highest fares likely correlates with upper class
In [20]:
train_df.describe(include='object')
Out[20]:
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Braund, Mr. Owen Harris male 347082 B96 B98 S
freq 1 577 7 4 644

Initial Inferences:

  • Name: Contains surname and title info, probable tokens for splitting
  • Sex: Majority of passengers were male
  • Ticket: Contains number; some contain prefixes with potentially useful info. 210 passengers shared the same ticket number, indicating traveling together (may be family and/or household staff)
  • Cabin: Contains room number and deck letter prefix. Some cabin strings contain multiple cabin numbers, may also signal family or household staff relationships. 687 values missing
  • Embarked: Majority of passengers embarked from Southampton, England (644). 2 values missing

Row Samples¶

In [23]:
train_df.head()
Out[23]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Data Types¶

In [25]:
train_df.dtypes
Out[25]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Missing Values Summary¶

In [27]:
train_df.isnull().sum().loc[lambda x: x > 0]
Out[27]:
Age         177
Cabin       687
Embarked      2
dtype: int64
In [28]:
test_df.isnull().sum().loc[lambda x: x > 0]
Out[28]:
Age       86
Fare       1
Cabin    327
dtype: int64

Data Preparation¶

Missing Value Imputation¶

Embarked¶
  • Only 2 missing values in the data set, will impute based on available data
  • Both passengers list the same cabin and ticket number, strongly suggesting they traveled together and embarked from the same port
  • Historical records show Martha Evelyn Stone and (Rose) Amelie Icard embarked from Southampton (See 1,2 in References section)
    • Amelie was maid to Mrs Stone
  • This being said, to maximize predictive integrity of upcoming modeling we'll inspect data only available at train time instead.
  • Passengers residing on Deck B embarked from either Southamption or Cherbourg, increasing the likelihood these two passengers embarked from one of those two locations.
  • Verified Cabin B28 nor Ticket 113572 are not shared by any other passengers.
  • Verified no other passengers sharing the same last name as Icard or Stone.
  • Given 58.8% of 1st class passengers embarked from Southampton, it is more likely these passengers embarked from Southampton as well.
  • Missing Embarked values will be imputed with the most common embarkation point of passengers of the same Pclass.
In [33]:
train_df[train_df['Embarked'].isnull()][['Name', 'Pclass', 'Ticket', 'Cabin', 'Age', 'Fare', 'Parch', 'SibSp']]
Out[33]:
Name Pclass Ticket Cabin Age Fare Parch SibSp
61 Icard, Miss. Amelie 1 113572 B28 38.0 80.0 0 0
829 Stone, Mrs. George Nelson (Martha Evelyn) 1 113572 B28 62.0 80.0 0 0
In [34]:
train_df[train_df['Ticket'] == '113572'][['Name', 'Pclass', 'Ticket', 'Cabin', 'Age', 'Fare', 'Parch', 'SibSp']]
Out[34]:
Name Pclass Ticket Cabin Age Fare Parch SibSp
61 Icard, Miss. Amelie 1 113572 B28 38.0 80.0 0 0
829 Stone, Mrs. George Nelson (Martha Evelyn) 1 113572 B28 62.0 80.0 0 0
In [35]:
train_df[train_df['Cabin'] == 'B28'][['Name', 'Pclass', 'Ticket', 'Cabin', 'Age', 'Fare', 'Parch', 'SibSp']]
Out[35]:
Name Pclass Ticket Cabin Age Fare Parch SibSp
61 Icard, Miss. Amelie 1 113572 B28 38.0 80.0 0 0
829 Stone, Mrs. George Nelson (Martha Evelyn) 1 113572 B28 62.0 80.0 0 0
In [36]:
train_df[train_df['Name'].str.contains("Stone")][['Name', 'Pclass', 'Ticket', 'Cabin', 'Age', 'Fare', 'Parch', 'SibSp']]
Out[36]:
Name Pclass Ticket Cabin Age Fare Parch SibSp
319 Spedden, Mrs. Frederic Oakley (Margaretta Corn... 1 16966 E34 40.0 134.5 1 1
829 Stone, Mrs. George Nelson (Martha Evelyn) 1 113572 B28 62.0 80.0 0 0
In [37]:
train_df[train_df['Name'].str.contains("Icard")][['Name', 'Pclass', 'Ticket', 'Cabin', 'Age', 'Fare', 'Parch', 'SibSp']]
Out[37]:
Name Pclass Ticket Cabin Age Fare Parch SibSp
61 Icard, Miss. Amelie 1 113572 B28 38.0 80.0 0 0
In [38]:
deck_df = train_df[['Cabin', 'Embarked']].copy()
deck_df['Deck'] = deck_df['Cabin'].str[0]
deck_df[deck_df['Deck'] == 'B']['Embarked'].value_counts()
Out[38]:
Embarked
S    23
C    22
Name: count, dtype: int64
In [39]:
first_class_df = train_df[train_df['Pclass'] == 1]
embarked_counts = first_class_df['Embarked'].value_counts(dropna=False).sort_index()
embarked_percent = (embarked_counts / embarked_counts.sum()) * 100
summary_df = pd.DataFrame({
    'Count': embarked_counts,
    'Percentage': embarked_percent.round(2)
})
summary_df.index.name = 'Embarked'
summary_df.reset_index(inplace=True)
print(summary_df)
  Embarked  Count  Percentage
0        C     85       39.35
1        Q      2        0.93
2        S    127       58.80
3      NaN      2        0.93
In [40]:
train_df.groupby(['Embarked', 'Pclass'])['Fare'].median()
Out[40]:
Embarked  Pclass
C         1         78.2667
          2         24.0000
          3          7.8958
Q         1         90.0000
          2         12.3500
          3          7.7500
S         1         52.0000
          2         13.5000
          3          8.0500
Name: Fare, dtype: float64
In [41]:
train_df['Embarked'].describe()  # Southamption was the most common embarkation point, used for fallback scenario
Out[41]:
count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object
In [42]:
def impute_embarked(df):
    """
    Imputs missing "Embarked" values with the most common Embarkation point from those of the same Pclass. 
    Sets missing value to "S" if mode can't be found (in scenario that all Embarked values for the current Pclass is missing (unlikely))

    Args:
        df (DataFrame): Data set to impute (either training or test data set)

    Returns:
        Updated DataFrame with imputed Embarked feature
    """
    def impute_embarked_with_mode(row, df):
        if pd.isna(row['Embarked']):
            mode_value = df[df['Pclass'] == row['Pclass']]['Embarked'].mode()
            return mode_value[0] if not mode_value.empty else 'S'  # Default to 'S' if mode is not found
        else:
            return row['Embarked']
    
    df['Embarked'] = df.apply(lambda row: impute_embarked_with_mode(row, df), axis=1)
    return df

prepared_train_df = impute_embarked(train_df)
prepared_test_df = impute_embarked(test_df)
Cabin¶
  • 1st class passengers are missing only 18.5% of cabin numbers vs 2nd and 3rd class passengers missing 91% and 97.5% of Cabin numbers respectively -- Suggests having a Cabin number is a socio-economic class indicator that should be captured.
  • The known cabin values also have additional signals that should be captured separately:
    • They contain more than one cabin designation (e.g. "B57 B59 B63 B66"):
      • Number of cabins becomes an indirect family and wealth signal
        • 1st class passengers had the most multi-cabins designation
    • Each cabin token is prefixed with a single character, most likely being the ship deck where is is located
    • The cabin number will also be extracted as a potential signal, as number corresponds to location of cabin on the ship
In [45]:
deck_df = train_df[['Pclass', 'Cabin']].copy()
missing_pct = deck_df.groupby('Pclass')['Cabin'].apply(lambda x: x.isna().mean() * 100).reset_index()
missing_pct.columns = ['Pclass', 'Missing_Cabin_Percentage']
print(missing_pct)
   Pclass  Missing_Cabin_Percentage
0       1                 18.518519
1       2                 91.304348
2       3                 97.556008
In [46]:
def derive_features_from_cabin_then_drop(df):
    """
    Creates FOUR new features based on contents of "Cabin" and then DROPS the "Cabin" column:
      1) "HasCabin" (bool): true if passenger had Cabin value
      2) "Cabin_count" (category): Number of cabins cited in passenger's Cabin value.  
                              Set to 1 for passengers with no Cabin value.
      3) "Deck" (category): Single letter identifying the deck where known Cabin was located (e.g. A, B C)
                            Set to 'M' for passengers with no Cabin value
      4) "Cabin_Location_s" (category): String indicating whether cab is located on "port" or "starboard" side of boat
                                        based on cabin number(s).  Set to "both" if string contains multiple cabin
                                        numbers that reside on both sides of boat

    Args:
        df (DataFrame): Data set to impute (either training or test data set)

    Returns:
        Nothing
    """
    df['HasCabin'] = (df['Cabin'].notnull()).astype(int)
    df['Cabin_count'] = df['Cabin'].apply(lambda x: 0 if pd.isna(x) else len(x.split()))
    df['Cabin_count'] = df['Cabin_count'].astype('category')

    # Extract first character from non-missing Cabin values; assign 'M' for missing Cabin values
    df['Deck'] = df['Cabin'].apply(lambda x: 'M' if pd.isna(x) else x[0])  
    deck_order = sorted(df["Deck"].dropna().unique())
    df["Deck"] = pd.Categorical(df["Deck"], categories=deck_order, ordered=True)

    # Implement Cabin_Location_s
    def determine_cabin_side(cabin_str):
        if pd.isna(cabin_str):
            return "no_cabin_info" 
    
        # Extract all numeric parts from the cabin string
        cabin_numbers = re.findall(r'\d+', cabin_str)
        if not cabin_numbers:
            return "no_cabin_number"
    
        cabin_nums = [int(num) for num in cabin_numbers]
    
        all_even = all(num % 2 == 0 for num in cabin_nums)
        all_odd = all(num % 2 != 0 for num in cabin_nums)
    
        if all_even:
            return "port"
        elif all_odd:
            return "starboard"
        else:
            return "port_and_starboard"
    
    # Apply function to create the new feature
    df['Cabin_Location_s'] = df['Cabin'].apply(determine_cabin_side).astype('category')
    
    df.drop(columns="Cabin", inplace=True)

# Create prepared training and test data frames to be used for EDA and modeling
derive_features_from_cabin_then_drop(prepared_train_df)
derive_features_from_cabin_then_drop(prepared_test_df)
In [47]:
print(prepared_train_df[['HasCabin']].value_counts().sort_index())
print()
print(prepared_train_df[['Pclass', 'Cabin_count']].value_counts().sort_index())
print()
print(prepared_train_df[['Deck']].value_counts().sort_index())
print()
print(prepared_train_df[['Cabin_Location_s']].value_counts().sort_index())
HasCabin
0           687
1           204
Name: count, dtype: int64

Pclass  Cabin_count
1       0               40
        1              156
        2               12
        3                6
        4                2
2       0              168
        1               16
3       0              479
        1                8
        2                4
Name: count, dtype: int64

Deck
A        15
B        47
C        59
D        33
E        32
F        13
G         4
M       687
T         1
Name: count, dtype: int64

Cabin_Location_s  
no_cabin_info         687
no_cabin_number         4
port                  108
port_and_starboard      2
starboard              90
Name: count, dtype: int64
Age¶
  • Dropping rows with missing age is clearly avoided given the large number of missing values relative to the total samples in the data set plus the availability of surrounding data to inform a calculated imputation strategy.
  • Imputing the missing values with the mean/median age (29.7, 28 respectively) or some missing indicator (e.g. -1) is inferior to researching correlation strength between Age and the following features and taking the median age of each correlated group.
  • The following correlation strengths were identified (direction ignored given label encoding order undefined):
    • Pclass (0.42): Moderate correlation
    • Title (0.32): Moderate correlation
  • Given the above, Age will be imputed with the median age found for each group of passengers split by Pclass and Title.
In [50]:
# Determine correlation strength between Age and adjacent features
le = LabelEncoder()
df_corr = train_df[['Age', 'Sex', 'Pclass', 'Name']].copy()
df_corr['Sex_encoded'] = le.fit_transform(df_corr['Sex'])
df_corr['Title'] = df_corr['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False) + '.'
df_corr['Title_encoded'] = le.fit_transform(df_corr['Title'])
corr = df_corr[['Age', 'Sex_encoded', 'Pclass', 'Title_encoded']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr.loc[['Age']], annot=True, fmt=".2f", cmap='coolwarm', center=0, square=True)
plt.title("Feature Correlation Heatmap")
plt.show()
No description has been provided for this image
In [51]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df_corr, x="Title", y="Age", hue="Pclass")
plt.title("Median Age by Title differs per Pclass")
plt.show()
No description has been provided for this image
In [52]:
def impute_age(df):
    """
    Imputes missing values in the "Age" column of the specified DataFrame with the median age
    of the data grouped by "Pclass" and Title (extracted from "Name")

    Adds "Title" column to the DataFrame as well.

    Args:
        df (DataFrame): Data set to impute (either training or test data set)

    Returns:
        Nothing
    """

    # First attempt to impute by Pclass X Title
    df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False) + '.'
    pclass_title_age_median_map = df.groupby(['Pclass', 'Title'])['Age'].median()
    def impute_age_by_pclass_title(row):
        if pd.notna(row['Age']):
            return row['Age']
        return pclass_title_age_median_map.loc[row['Pclass'], row['Title']]
    df['Age'] = df.apply(impute_age_by_pclass_title, axis=1)

    # In the scenario where all ages for a given title are missing, impute with median age for Sex
    sex_age_median_map = df.groupby(['Sex'])['Age'].median()
    def impute_age_by_sex(row):
        if pd.notna(row['Age']):
            return row['Age']
        return sex_age_median_map.loc[row['Sex']]
    df['Age'] = df.apply(impute_age_by_sex, axis=1)

impute_age(prepared_train_df)
impute_age(prepared_test_df)
Fare¶
  • We'll set missing Fare values to the median Fare of passengers from the same class and embarkation point
In [55]:
sns.boxplot(data=train_df, x="Pclass", y="Fare", hue="Embarked")
plt.show()
No description has been provided for this image
In [56]:
def impute_fare(df):
    """
    Imputes missing values in the "Fare" column of the specified DataFrame with the median Fare
    of passengers with the same Pclass and Embarked 

    Args:
        df (DataFrame): Data set to impute (either training or test data set)

    Returns:
        Nothing
    """
    df['Fare'] = df['Fare'].fillna(df.groupby(['Pclass', 'Embarked'])['Fare'].transform('median'))

impute_fare(prepared_train_df)
impute_fare(prepared_test_df)
In [57]:
# Confirm no more missing values
print(f"Missing Training Data Values:\n{prepared_train_df.isnull().sum().loc[lambda x: x > 0]}")
print(f"\nMissing Test Data Values:\n{prepared_test_df.isnull().sum().loc[lambda x: x > 0]}")
Missing Training Data Values:
Series([], dtype: int64)

Missing Test Data Values:
Series([], dtype: int64)

Exploratory Data Analysis¶

In [59]:
def plot_vars(plot_df):

    # Identify discrete and continuous columns
    discrete_vars = [col for col in plot_df.columns 
                     if plot_df[col].dtype in ['int64', 'int32', 'object', 'category'] 
                     and plot_df[col].nunique() <= 20]
    
    continuous_vars = [col for col in plot_df.columns 
                       if (plot_df[col].dtype in ['float64', 'float32'] 
                           or (plot_df[col].dtype in ['int64', 'int32'] and plot_df[col].nunique() > 20))]
    
    # Combine for plotting
    all_vars = discrete_vars + continuous_vars

    n_cols = 3
    n_rows = int(np.ceil(len(all_vars) / n_cols))
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(4 * n_cols, 3 * n_rows))
    
    # Flatten axes for easy indexing
    axes = axes.flatten()
    
    for i, col in enumerate(all_vars):
        if col in discrete_vars:
            sns.histplot(plot_df[col], ax=axes[i], discrete=True, shrink=0.8)
        else:
            sns.histplot(plot_df[col], ax=axes[i], kde=True, bins=30)
        
        axes[i].set_title(col)
        axes[i].set_xlabel('')
        axes[i].set_ylabel('Count')
    
    # Hide any unused subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])
    
    plt.tight_layout()
    plt.show()
In [60]:
# Plot base dataset
plot_vars(prepared_train_df)
No description has been provided for this image

Note: Name and Ticket are not plotted above but will be analyzed below. PassengerId will be ignored given its uniform distribution/definition.

Target¶

  • Global survival rate: 38.4%
  • Target is imbalanced 61.6% vs 38.4%
  • Be sure to stratify target column during cross-validation
In [64]:
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Survived")
plt.show()

global_survival_rate = prepared_train_df['Survived'].mean()
print(f"Global Survival Rate: {global_survival_rate:.4f}")
No description has been provided for this image
Global Survival Rate: 0.3838

Individual Features x Target¶

  • The relationship between each feature and the target class Survived is analyzed below.
  • The relationship between combinations of features and the target class is analyzed in the subsequent subsection titled "Composite Features x Target".
Pclass¶
In [68]:
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Pclass", hue="Survived")
plt.show()

survival_df = (
    prepared_train_df
    .groupby("Pclass")
    .agg(Survival_Rate=('Survived', 'mean'), Count=('Survived', 'size'))
    .reset_index()
    .sort_values(by="Survival_Rate", ascending=False)
)
survival_df
No description has been provided for this image
Out[68]:
Pclass Survival_Rate Count
0 1 0.629630 216
1 2 0.472826 184
2 3 0.242363 491
  • 1st class had highest survival rate (62%)
  • 2nd class had marginal survival rate (47%)
  • 3rd class had lowest survival rate (24%)
  • Non-linear relationship, high-sample size per class, clear threshold split points for trees
Sex¶
In [71]:
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Sex", hue="Survived")
plt.show()

survival_df = (
    prepared_train_df
    .groupby("Sex")
    .agg(Survival_Rate=('Survived', 'mean'), Count=('Survived', 'size'))
    .reset_index()
    .sort_values(by="Survival_Rate", ascending=False)
)
survival_df
No description has been provided for this image
Out[71]:
Sex Survival_Rate Count
0 female 0.742038 314
1 male 0.188908 577
  • Most females survived (74.2%), consistent with "women and children first" evacuation protocol
  • Vast majority of males perished (18.8%)
  • Non-linear relationship, high-sample size per class, clear threshold split points for trees
SibSp¶
In [74]:
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="SibSp", hue="Survived")
plt.title("All SibSp Values")
plt.show()
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df[prepared_train_df['SibSp'] > 1], x="SibSp", hue="Survived")
plt.title("SibSp > 1")
plt.show()

survival_df = (
    prepared_train_df
    .groupby("SibSp")
    .agg(Survival_Rate=('Survived', 'mean'), Count=('Survived', 'size'))
    .reset_index()
    .sort_values(by="SibSp", ascending=True)
)
survival_df
No description has been provided for this image
No description has been provided for this image
Out[74]:
SibSp Survival_Rate Count
0 0 0.345395 608
1 1 0.535885 209
2 2 0.464286 28
3 3 0.250000 16
4 4 0.166667 18
5 5 0.000000 5
6 8 0.000000 7
  • Passengers with no siblings or spouses had a lower survival rate (34.5%) than those with 1 or 2 sibling/spouses (53.5& and 46.4%, respectively)
  • Survival rate decreased as number of siblings/spouses increased from 3 to 4 (25%, 16.7%)
  • Passengers with 5 or more sibling/spouses did not survive
  • Non-Linear relationship: Survival rate increases from 0->1 and then decreases from 1->8
  • Low sample sizes for SibSp >= 2
    • Samples with Sibsp=>5 and =8 presumably came from the same family => Potentially rare case, risks overfitting
Parch¶
In [77]:
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Parch", hue="Survived")
plt.show()
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df[prepared_train_df['Parch'] > 2], x="Parch", hue="Survived")
plt.show()

survival_df = (
    prepared_train_df
    .groupby("Parch")
    .agg(Survival_Rate=('Survived', 'mean'), Count=('Survived', 'size'))
    .reset_index()
    .sort_values(by="Parch", ascending=True)
)
survival_df
No description has been provided for this image
No description has been provided for this image
Out[77]:
Parch Survival_Rate Count
0 0 0.343658 678
1 1 0.550847 118
2 2 0.500000 80
3 3 0.600000 5
4 4 0.000000 4
5 5 0.200000 5
6 6 0.000000 1
  • Passengers with no parents or children had lower survival rate (34.3%) to those with 1-3 parents/children
  • Survival rate increased to 60% from 1 to 3, and then drops sharply to 0 from 4-6
  • Non-Linear relationship: Survival rate increases from 0->1 and then decreases from 1->2,5.
  • Low sample sizes for Parch >= 3
Embarked_¶
In [80]:
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Embarked", hue="Survived")
plt.show()

survival_df = (
    prepared_train_df
    .groupby("Embarked")
    .agg(Survival_Rate=('Survived', 'mean'), Count=('Survived', 'size'))
    .reset_index()
    .sort_values(by="Survival_Rate", ascending=False)
)
survival_df
No description has been provided for this image
Out[80]:
Embarked Survival_Rate Count
0 C 0.553571 168
1 Q 0.389610 77
2 S 0.339009 646
  • Most people from Cherbourg survived (55.5%)
  • Most people embarked from Southamption, which had the lowest survival rate (33.9%)
  • Most people from Queensland perished, had 2nd lowest survival rate (39.0%)
  • Clear survival threshold point by splitting on Embarked == 'C'
  • High sample size for C and S classes, moderate sample size for Q
HasCabin¶
In [83]:
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="HasCabin", hue="Survived")
plt.xticks([0, 1], ["0", "1"])
plt.show()

survival_df = (
    prepared_train_df
    .groupby("HasCabin")
    .agg(Survival_Rate=('Survived', 'mean'), 
         Count=('Survived', 'size'),
        Pclass_1_Count=('Pclass', lambda x: (x == 1).sum()),
        Pclass_2_Count=('Pclass', lambda x: (x == 2).sum()),
        Pclass_3_Count=('Pclass', lambda x: (x == 3).sum())
        )
    .reset_index()
    .sort_values(by="Survival_Rate", ascending=False)
)
survival_df
No description has been provided for this image
Out[83]:
HasCabin Survival_Rate Count Pclass_1_Count Pclass_2_Count Pclass_3_Count
1 1 0.666667 204 176 16 12
0 0 0.299854 687 40 168 479
  • Clear survival threshold split between those with a cabin number (67%) and those without one (30%)
  • Majority of passengers with a cabin number were 1st class, aligning having a cabin number with social-economic status and higher survival rate
  • Not having a cabin number is also associated with lower class and survival rate
Cabin_count¶
In [86]:
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Cabin_count", hue="Survived")
plt.title("All Cabin_count Values")
plt.show()
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df[prepared_train_df['Cabin_count'] != 0], x="Cabin_count", hue="Survived")
plt.title("Cabin_count > 0")
plt.show()

survival_df = (
    prepared_train_df
    .groupby("Cabin_count", observed=True)
    .agg(Survival_Rate=('Survived', 'mean'), 
         Count=('Survived', 'size'),
        Pclass_1_Count=('Pclass', lambda x: (x == 1).sum()),
        Pclass_2_Count=('Pclass', lambda x: (x == 2).sum()),
        Pclass_3_Count=('Pclass', lambda x: (x == 3).sum())
        )
    .reset_index()
    .sort_values(by="Cabin_count", ascending=True)
)
survival_df
No description has been provided for this image
No description has been provided for this image
Out[86]:
Cabin_count Survival_Rate Count Pclass_1_Count Pclass_2_Count Pclass_3_Count
0 0 0.299854 687 40 168 479
1 1 0.677778 180 156 16 8
2 2 0.562500 16 12 0 4
3 3 0.500000 6 6 0 0
4 4 1.000000 2 2 0 0
  • Those listing more than one cabin on their ticket had higher survival rates than those listing only one
  • Those listing more than one cabin were mainly 1st class passengers
  • Signal behavior is similar to Parch/Sibsp (i.e. a "family size" indicator)
  • Low sample size for Cabin_count >= 2
Cabin_Location_s¶
In [89]:
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Cabin_Location_s", hue="Survived")
plt.show()

survival_df = (
    prepared_train_df
    .groupby("Cabin_Location_s", observed=True)
    .agg(Survival_Rate=('Survived', 'mean'), 
         Count=('Survived', 'size'),
         Pclass_1_Count=('Pclass', lambda x: (x == 1).sum()),
         Pclass_2_Count=('Pclass', lambda x: (x == 2).sum()),
         Pclass_3_Count=('Pclass', lambda x: (x == 3).sum())
        )
    .reset_index()
    .sort_values(by="Count", ascending=False)
)
survival_df
No description has been provided for this image
Out[89]:
Cabin_Location_s Survival_Rate Count Pclass_1_Count Pclass_2_Count Pclass_3_Count
0 no_cabin_info 0.299854 687 40 168 479
2 port 0.611111 108 96 6 6
4 starboard 0.733333 90 77 7 6
1 no_cabin_number 0.500000 4 1 3 0
3 port_and_starboard 1.000000 2 2 0 0
  • Starboard-side cabins had higher success rate (73.3%), consistent with historical accounts of Officer Murdoch following protocol of "women and children first, and then men if space remained".
  • Post-side cabins lower success rate (61%) consistent with historical accounts of Officer Lightoller only allowing women (even at the expense of leaving boat seats empty!) and children and declining most men.
  • Missing cabin info had significantly lower success rate (30%) and was mostly comprised ofr 2nd and 3rd class passengers, suggesting missing cabin info correlated with passenger status
  • High sample sizes for no_cabin_info, port, and starboard; very low sample size for port_and_starboard class
Deck¶
In [92]:
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Deck", hue="Survived")
plt.title("All Deck Values")
plt.show()
sns.countplot(data=prepared_train_df[prepared_train_df['Deck'] != 'M'], x="Deck", hue="Survived")
plt.title("Deck != M")
plt.show()

survival_df = (
    prepared_train_df
    .groupby("Deck", observed=True)
    .agg(
        Survival_Rate=('Survived', 'mean'),
        Count=('Survived', 'size'),
        Pclass_1_Count=('Pclass', lambda x: (x == 1).sum()),
        Pclass_2_Count=('Pclass', lambda x: (x == 2).sum()),
        Pclass_3_Count=('Pclass', lambda x: (x == 3).sum())
    )
    .reset_index()
    .sort_values(by="Deck", ascending=True)
)
survival_df
No description has been provided for this image
No description has been provided for this image
Out[92]:
Deck Survival_Rate Count Pclass_1_Count Pclass_2_Count Pclass_3_Count
0 A 0.466667 15 15 0 0
1 B 0.744681 47 47 0 0
2 C 0.593220 59 59 0 0
3 D 0.757576 33 29 4 0
4 E 0.750000 32 25 4 3
5 F 0.615385 13 0 8 5
6 G 0.500000 4 0 0 4
7 M 0.299854 687 40 168 479
8 T 0.000000 1 1 0 0
  • Each deck had a different survival rate, some similar to others
  • Decks A-B-C exclusive to 1st Class
  • Decks B-D-E have similar high survival rates (~75%)
  • Decks F-G only 2nd and 3rd class passengers and low sample sizes each
Title¶
In [95]:
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df.query("Title in ['Mr.', 'Mrs.', 'Miss.', 'Master.']"), x="Title", hue="Survived")
plt.title("Titles in (Mr., Mrs., Miss., Master.)")
plt.show()
plt.figure(figsize=(12,3))
sns.countplot(data=prepared_train_df.query("Title not in ['Mr.', 'Mrs.', 'Miss.', 'Master.']"), x="Title", hue="Survived")
plt.title("All other Titles")
plt.show()

survival_by_title_df = (
    prepared_train_df
    .groupby("Title")
    .agg(Survival_Rate=('Survived', 'mean'), 
         Count=('Survived', 'size'),
         Pclass_1_Count=('Pclass', lambda x: (x == 1).sum()),
         Pclass_2_Count=('Pclass', lambda x: (x == 2).sum()),
         Pclass_3_Count=('Pclass', lambda x: (x == 3).sum()))
    .reset_index()
    .sort_values(by="Count", ascending=False)
)
print(survival_by_title_df)

# Confirmation of age range for Master. title
print("\nConfirmation that 'Master.' Title ages reflect 'young boy': 0.42 - 12")
print(prepared_train_df.query("Title in ['Master.']")['Age'].describe())

# Confirmation Ms. title should be grouped with Miss. (unmarried) (assumming she would have traveled with spouse)
print("\nConfirmation Ms. title should be grouped with Miss. (unmarried)")
print(prepared_train_df.query("Title in ['Ms.']")[['Name', 'SibSp']])
No description has been provided for this image
No description has been provided for this image
        Title  Survival_Rate  Count  Pclass_1_Count  Pclass_2_Count  \
12        Mr.       0.156673    517             107              91   
9       Miss.       0.697802    182              46              34   
13       Mrs.       0.792000    125              42              41   
8     Master.       0.575000     40               3               9   
4         Dr.       0.428571      7               5               2   
15       Rev.       0.000000      6               0               6   
7      Major.       0.500000      2               2               0   
1        Col.       0.500000      2               2               0   
10      Mlle.       1.000000      2               2               0   
11       Mme.       1.000000      1               1               0   
14        Ms.       1.000000      1               0               1   
0       Capt.       0.000000      1               1               0   
6       Lady.       1.000000      1               1               0   
5   Jonkheer.       0.000000      1               1               0   
3        Don.       0.000000      1               1               0   
2   Countess.       1.000000      1               1               0   
16       Sir.       1.000000      1               1               0   

    Pclass_3_Count  
12             319  
9              102  
13              42  
8               28  
4                0  
15               0  
7                0  
1                0  
10               0  
11               0  
14               0  
0                0  
6                0  
5                0  
3                0  
2                0  
16               0  

Confirmation that 'Master.' Title ages reflect 'young boy': 0.42 - 12
count    40.000000
mean      4.516750
std       3.433651
min       0.420000
25%       1.750000
50%       4.000000
75%       7.250000
max      12.000000
Name: Age, dtype: float64

Confirmation Ms. title should be grouped with Miss. (unmarried)
                          Name  SibSp
443  Reynaldo, Ms. Encarnacion      0
  • Top 3 Titles (Mr., Miss., Mrs.) reasonably spread across passenger classes, with high sample sizes
  • "Master" title relatively low sample size, but doesn't warrant binning with surrounding Titles given its implication of age
Age_¶
In [98]:
sns.displot(data=prepared_train_df, x="Age", hue="Survived", kde=True, bins=range(0, 85, 5))
plt.show()

prepared_train_df['Age_bin'] = pd.cut(prepared_train_df['Age'], bins=range(0, 85, 5))
survival_by_Age_bin_df = (
    prepared_train_df
    .groupby("Age_bin", observed=True)
    .agg(
        Survival_Rate=('Survived', 'mean'),
        Count=('Survived', 'size')
    )
    .reset_index()
    .sort_values(by="Age_bin")
)
print(survival_by_Age_bin_df)
No description has been provided for this image
     Age_bin  Survival_Rate  Count
0     (0, 5]       0.687500     48
1    (5, 10]       0.350000     20
2   (10, 15]       0.578947     19
3   (15, 20]       0.403101    129
4   (20, 25]       0.354839    124
5   (25, 30]       0.251256    199
6   (30, 35]       0.462264    106
7   (35, 40]       0.379310     87
8   (40, 45]       0.454545     55
9   (45, 50]       0.400000     40
10  (50, 55]       0.416667     24
11  (55, 60]       0.388889     18
12  (60, 65]       0.285714     14
13  (65, 70]       0.000000      3
14  (70, 75]       0.000000      4
15  (75, 80]       1.000000      1
  • First Age bin (0,5] has the highest survival rate (68.7%), consistent with the "women and children first" evacauation policy
  • Survival rate decreases for Age Bins (15,20] through (25-30], dropping to lowest bin rate of 25.1%
  • Low sample sizes for ages 40+, all have similar success rates w/ exception of n=1 (75, 80] passenger
Age_Group¶
  • Created domain-specific binned Age to assess survival relationship and capture the "Young Child" difference in survival rate found above
  • Young Child survival rate greatest of all groups; consistent with "women and children first" evac protocol
  • Survival rate amongst remaining groups does not appear significant, will investigate when exploring Composite Features x Target
In [102]:
def create_feature_Age_Group(train_df, test_df):
    """
    Add an ordered categorical 'AgeGroup' feature to train and test DataFrames using revised domain-specific bins
    designed to reduce KL divergence to ≤ 0.02.

    Age bins:
        - Young_Child: 0-5
        - Child:       6–17
        - Young_Adult: 18–29
        - Adult:       30–59
        - Senior:      60+

    Args:
        train_df (pd.DataFrame): Training dataset containing an 'Age' column.
        test_df (pd.DataFrame): Test dataset containing an 'Age' column.

    Modifies:
        Adds an 'AgeGroup' column with ordered categorical values to both datasets.
    """
    bins = [0, 5, 17, 29, 59, np.inf]
    labels = ['Young_Child', 'Child', 'Young_Adult', 'Adult', 'Senior']
    age_cat_type = pd.CategoricalDtype(categories=labels, ordered=True)

    for df in [train_df, test_df]:
        df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
        df['Age_Group'] = df['Age_Group'].astype(age_cat_type)


create_feature_Age_Group(prepared_train_df, prepared_test_df)
In [103]:
sns.displot(data=prepared_train_df, x="Age_Group", hue="Survived", kde=True, bins=range(0, 85, 5))
plt.show()

survival_by_Age_Group_bin_df = (
    prepared_train_df
    .groupby("Age_Group", observed=True)
    .agg(
        Survival_Rate=('Survived', 'mean'),
        Count=('Survived', 'size')
    )
    .reset_index()
    .sort_values(by="Age_Group")
)
print(survival_by_Age_Group_bin_df)
No description has been provided for this image
     Age_Group  Survival_Rate  Count
0  Young_Child       0.687500     48
1        Child       0.434783     69
2  Young_Adult       0.310606    396
3        Adult       0.423295    352
4       Senior       0.269231     26
Fare_¶
In [105]:
print("Key Detail: Fare is shared amongst passengers sharing same ticket number!")
print("Sample subset of duplicate Ticket/Fare combinations in the data:")
fare_duplicates = prepared_train_df.groupby("Ticket")[['Ticket', "Fare"]].value_counts().head(10).reset_index()
fare_duplicates.columns = ["Ticket Number", "Fare", "Number of Duplicates"]
print(fare_duplicates)

sns.displot(data=prepared_train_df, x="Fare", hue="Survived", kde=True)
plt.title("Fare (raw) Distribution")
plt.show()
prepared_train_df["Fare_log"] = np.log1p(prepared_train_df["Fare"])
prepared_test_df["Fare_log"] = np.log1p(prepared_test_df["Fare"])
sns.displot(data=prepared_train_df, x="Fare_log", hue="Survived", kde=True)
plt.title("Fare (log) Distribution")
plt.show()

prepared_train_df['Fare_log_bin'] = pd.cut(prepared_train_df['Fare_log'], bins=np.arange(0, 8, 0.5))
prepared_test_df['Fare_log_bin'] = pd.cut(prepared_test_df['Fare_log'], bins=np.arange(0, 8, 0.5))
survival_by_Fare_log_bin_df = (
    prepared_train_df
    .groupby("Fare_log_bin", observed=True)
    .agg(
        Survival_Rate=('Survived', 'mean'),
        Count=('Survived', 'size')
    )
    .reset_index()
    .sort_values(by="Fare_log_bin")
)
print(survival_by_Fare_log_bin_df)
Key Detail: Fare is shared amongst passengers sharing same ticket number!
Sample subset of duplicate Ticket/Fare combinations in the data:
  Ticket Number     Fare  Number of Duplicates
0        110152  86.5000                     3
1        110413  79.6500                     3
2        110465  52.0000                     2
3        110564  26.5500                     1
4        110813  75.2500                     1
5        111240  33.5000                     1
6        111320  38.5000                     1
7        111361  57.9792                     2
8        111369  30.0000                     1
9        111426  26.5500                     1
No description has been provided for this image
No description has been provided for this image
  Fare_log_bin  Survival_Rate  Count
0   (1.5, 2.0]       0.000000      3
1   (2.0, 2.5]       0.223496    349
2   (2.5, 3.0]       0.414286    140
3   (3.0, 3.5]       0.456647    173
4   (3.5, 4.0]       0.400000     70
5   (4.0, 4.5]       0.641026     78
6   (4.5, 5.0]       0.823529     34
7   (5.0, 5.5]       0.666667     18
8   (5.5, 6.0]       0.625000      8
9   (6.0, 6.5]       1.000000      3
  • Important: Fare value reflects amount paid for all passengers sharing the same ticket (see "Ticket" EDA for more info on sharing). Here, fare refers to "aggregated fare".
  • Aggregated Fare skewed heavily to right; log transformation performed to create and plot new Fare_log feature for analysis
  • Low sample sizes observed for (1.5, 2.0] bin and bins >= (4.5, 5.0]
  • Fares also overlap across multiple classes
  • Feature Engineering Plan:
    • First, we should calculate a "Fare per person" to account for scenarios where listed fare is total paid amongst passengers sharing same ticket number.
    • Bin the low sample size "Fare per person" ranges to increase generalizability
    • Combine with Pclass if needed to capture more context around Fare bins and prevent generalization errors around Fare-only patterns
Summary of Single Feature Relationship with Target¶
Feature Summary of Relationship with Target Domain Implication / Further EDA
Pclass P1: Most likely to survive, 63%; P2: Near equal, 47%, P3: Less likely to survive, 24% Priority given to higher class, more affluent
Sex male: Less likely, 19%, female: More likely, 74% Priority given to women, consistent with "women and children first" evac
SibSp 0: Less likely, 34.5%, 1: More likely, 53.6%, 2: Near equal, 46%, 3: low > 16->0% Larger families potentially more difficult to evacuate from cabins, or perhaps larger families tended to be in lower ticket classes less likely to survive
Parch Similar to Sibsp for 0->2 & 5->6; 3: higher survival rate (though small sample size, could be all same family= rare case) See SibSp notes
Embarked C: Had marginally higher success rate, 55%, Q: 39%, S: 34% C may have most high class ticket holders; Q/S predominately lower class (check in next section)
HasCabin Having a Cabin number: 66%, Not Having: 30% Having a cabin number may be associated with high class or higher-higher class, check in next section
Cabin_count Those with more than one cabin listed had success rates 50% and higher May represent elite families and may be all one family
Cabin_Location_s Those with cabin numbers listed on starboard side had 12% higher survival Consistent with historical reports of starboard evac procedure being more lenient than port-side evac
Deck Of those with cabin numbers, Decks B, D, and E had highest survival rates (~75%) Decks may have had closer proximity to lifeboats and/or disproportionate number of female and/or P1 ticket holders
Title "Mr." title had lowest survival, 15.6%; Miss./Mrs. had highest survival, 69.8%/79.2% respectively; Master. had 57.5%, marginally higher survival Title implies both Sex and Age; High female title rates aligns with "women and children first", however marginally lower rates for children does not align (perhaps most children were lower class?)
Age Only for ages (0,5] was survival high, 68.8%; (10,15] had second highest survival, 57.9%; all other bins had survival between 28.6% and 46% (with exception of sole surviving (75,80] passenger)) The high survival of (0,5] aligns with "women and children first" evac protocol; the remaining similar rates amongst the bins suggests that age was not a strong factor in influencing age; Created Age_Group to explore significance of Survival relationship in Composite Features x Target section
Fare/Fare_log Log1p(Fare) values higher than 4.0 had the highest survival rates ranging from 64% to 100% Higher fares an indication of social economic class

Composite Feature x Target¶

In [110]:
def plot_survival_heatmap(df, feature_1, feature_2, target_col='Survived', cmap='Blues'):
    """
    Plots a heatmap showing survival rate and sample size per combination of two categorical features.

    Parameters:
        df (pd.DataFrame): The dataset containing the features and target.
        feature_1 (str): Feature for the y-axis.
        feature_2 (str): Feature for the x-axis.
        target_col (str): Binary target column, e.g. 'Survived'.
        cmap (str): Seaborn colormap for the heatmap.
    """
    # Compute survival rates and counts
    grouped = df.groupby([feature_1, feature_2], observed=True)[target_col]
    survival_rate = grouped.mean().unstack()
    sample_size = grouped.count().unstack()

    # Create string annotations like "0.73\n(n=88)"
    annotations = survival_rate.copy().astype("object")  # Prevent dtype warning
    for i in survival_rate.index:
        for j in survival_rate.columns:
            rate = survival_rate.loc[i, j]
            count = sample_size.loc[i, j]
            if pd.notna(rate) and pd.notna(count):
                annotations.loc[i, j] = f"{rate:.2f}\n(n={int(count)})"
            else:
                annotations.loc[i, j] = ""

    # Plot heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(survival_rate, annot=annotations, fmt="", cmap=cmap, cbar=True, linewidths=0.5, linecolor='gray')
    plt.title(f"Survival Rate by {feature_1} × {feature_2}")
    plt.xlabel(feature_2)
    plt.ylabel(feature_1)
    plt.tight_layout()
    plt.show()
Pclass x Sex¶
  • Visually distinct differences in survival rates of different combinations of gender and ticket class
  • Consistent with priority given to social class and "women (and children) first" evac policy
  • Next Step: Chi-squared test survival association with Pclass x Sex; Explore crossing additional features given high sample sizes.
In [113]:
def create_feature_Pclass_Sex(train_df, test_df):
    """
    Creates a composite feature 'Pclass_Sex' by combining:
    - Pclass (1, 2, 3)
    - Sex ('male' or 'female')

    Example values: '1_male', '3_female'

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Pclass_Sex' column to both dataframes)
    """
    def combine(pclass, sex):
        return f"{pclass}_{sex}"

    for df in [train_df, test_df]:
        df['Pclass_Sex'] = df.apply(
            lambda row: combine(row['Pclass'], row['Sex']), axis=1
        ).astype(str)

    print("Created 'Pclass_Sex' in train_df and test_df.")

create_feature_Pclass_Sex(prepared_train_df, prepared_test_df)
Created 'Pclass_Sex' in train_df and test_df.
In [114]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Sex', cmap=custom_cmap)
No description has been provided for this image
Pclass x Title¶
  • P1 and P2 Young boys (Master.) had 100% survival rate (beware of n=12 sample size)
    • This interaction shows significantly heightened survival to young males compared to crossing Pclass x Sex
  • Survival rates amongst female titles (Mrs., Miss.) looks consistent to Pclass x Sex signal
  • Next Step: Chi-squared test association with Survival, investigate creating Master-specific breakout.
In [117]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Title', cmap=custom_cmap)
No description has been provided for this image
Pclass x Parch¶
  • Survival rate increased for P1 and P2 passengers as Parch increased (ignoring groups with n < 10)
  • Survival rate of P3 passengers inceased from 0->1 and then decreased from 1->2
  • Next Step: Chi-squared test association between Pclass and Parch + SibSp.
In [120]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Parch', cmap=custom_cmap)
No description has been provided for this image
Pclass x SibSp¶
  • Survival rate increased for P1 passengers as Parch increased (ignoring groups with n < 10)
  • Survival rate of P3 passengers inceased from 0->1 and then decreased from 1->2
  • Next Step: Chi-squared test association between Pclass and Parch + SibSp.
In [123]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'SibSp', cmap=custom_cmap)
No description has been provided for this image
Sex x Parch¶
  • Males with at least one parent/child had near double survival rate of males traveling alone.
  • Next Step: Chi-squared test association between Sex and Parch; consider is_Male_Parch_0 feature.
In [126]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'Parch', cmap=custom_cmap)
No description has been provided for this image
Sex x SibSp¶
  • Similar observation with Parch
  • Males with at least one sibling or spouse had near double survival rate of males traveling alone.
  • Next Step: Chi-squared test survival association with Sex and SibSp; consider is_Male_SibSp0 feature.
In [129]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'SibSp', cmap=custom_cmap)
No description has been provided for this image
Pclass x Embarked¶
  • 3rd class passengers from Southampton had distinctly lowest survival rate (p = 0.19)
  • 3rd class survival of Cherbourg and Queenstown equal and lower than 1st and 2nd classes (p = 0.38)
  • Very few 1st and 2nd class passengers from Queenstown (n < 10)
  • Cherbourg survival of 1st and 2nd class greater than Southamption
  • Next Step: Chi-squared test survival association with Pclass x Embarked.
In [132]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Embarked', cmap=custom_cmap)
No description has been provided for this image
Sex x Embarked¶
  • Males from Queenstown had a distinctly low survival rate (p = 0.07)
  • Well supported across all groups (n > 30)
  • Next Step: Chi-squared test survival association with Sex x Embarked.
In [135]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'Embarked', cmap=custom_cmap)
No description has been provided for this image
Pclass x HasCabin¶
  • 3rd class passengers without a cabin designation had a distinctly low survival rate (p = 0.24)
  • 2nd class passengers with a cabin designation had a distinctly high survival rate (p = 0.81), mid sized sample
  • Next Step: Chi-squared test survival association with Pclass x HasCabin.
In [138]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'HasCabin', cmap=custom_cmap)
No description has been provided for this image
Sex x HasCabin¶
  • Females with a cabin designation had distinctly high survival rate (p = 0.94)
  • Males without a cabin designation had a distinctly low survival rate (p = 0.14)
  • Next Step: Chi-squared test survival association with Sex x HasCabin.
In [141]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'HasCabin', cmap=custom_cmap)
No description has been provided for this image
Parch x HasCabin¶
  • Those having between 0 and 2 parents/children and a cabin designation have distinctly higher survival rates.
  • Next Step: Chi-squared test survival association with Parch x HasCabin.
In [144]:
plot_survival_heatmap(prepared_train_df, 'Parch', 'HasCabin', cmap=custom_cmap)
No description has been provided for this image
SibSp x HasCabin¶
  • Similar story with Parch
  • Those having between 0 and 2 siblings/spouses and a cabin designation have distinctly higher survival rates.
  • Next Step: Chi-squared test survival association with SibSp x HasCabin. Investigate (Parch + SibSp) x HasCabin.
In [147]:
plot_survival_heatmap(prepared_train_df, 'SibSp', 'HasCabin', cmap=custom_cmap)
No description has been provided for this image
Embarked x HasCabin¶
  • Passengers from Queenstown with a cabin designation have a distinctly higher survival (p = 0.75).
  • Passengers from Southampton without a cabin designation have a distinctly lower survival (p = 0.27).
  • Next Step: Chi-squared test survival association with Embarked x HasCabin.
In [150]:
plot_survival_heatmap(prepared_train_df, 'Embarked', 'HasCabin', cmap=custom_cmap)
No description has been provided for this image
Pclass x Cabin_count¶
  • Looks redundant to Pclass x HasCabin
  • Low sample sizes at for cabin counts >= 2 (n < 10)
  • Next Step: Skip engineering for now.
In [153]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Cabin_count', cmap=custom_cmap)
No description has been provided for this image
Sex x Cabin_count¶
  • Looks redundant to Sex x HasCabin
  • Low sample sizes for Cabin_count >= 2 (n < 10)
  • Next Step: Skip engineering for now.
In [156]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'Cabin_count', cmap=custom_cmap)
No description has been provided for this image
Pclass x Cabin_Location_s¶
  • P1 starboard passengers had higher survival than P1 port passengers (p = 0.74 vs p = 0.60).
  • Sample sizes low for all scenarios
  • No_cabin_info scenario redundant to Pclass x HasCabin
  • Next Step: Chi-squared test survival association with Pclass x Cabin_Location_s
In [159]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Cabin_Location_s', cmap=custom_cmap)
No description has been provided for this image
Sex x Cabin_Location_s¶
  • Survival rate does not differ significantly across cabin locations.
  • Next Step: Skip engineering feature for now.
In [162]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'Cabin_Location_s', cmap=custom_cmap)
No description has been provided for this image
Pclass x Deck_bin¶
  • P1 Survival rates similar across (B, D, and E) and (A & M)
  • All other sample sizes across non-M deck P2/P3 classes are too small (n < 10)
  • Next Step: Chi-squared test survival association with Pclass x Deck_bin feature.
In [165]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Deck', cmap=custom_cmap)
No description has been provided for this image
In [166]:
def create_feature_Deck_bin(train_df, test_df):
    """
    Creates a categorical deck access feature:
    - High_Access: Decks A, B
    - Medium_Access: Decks C, D
    - Low_Access: Decks E,F,G M, T
    """
    access_map = {
        'A': 'AM',
        'M': 'AM',
        'B': 'BDE',
        'C': 'C',
        'D': 'BDE',
        'E': 'BDE',
        'F': 'Rare',
        'G': 'Rare',
        'T': 'Rare'
    }

    for df in [train_df, test_df]:
        df['Deck_bin'] = df['Deck'].map(access_map).astype('category').astype(str)

create_feature_Deck_bin(prepared_train_df, prepared_test_df)
In [167]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Deck_bin', cmap=custom_cmap)
No description has been provided for this image
Sex x Deck_bin¶
  • Survival rates similar across non-M decks
  • M-deck survival rates redundant to Sex x HasCabin
  • Sex x Deck_bin also appears redundant to Sex x HasCabin
  • Next Step: Skip engineering feature for now.
In [170]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'Deck', cmap=custom_cmap)
No description has been provided for this image
In [171]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'Deck_bin', cmap=custom_cmap)
No description has been provided for this image
Parch x Deck_bin¶
  • Those alone and with families residing on decks BDE had distinctly higher survival rates.
  • Solo travelers also had hightened survival rate on Deck C.
  • Next Step: Chi-squared test survival association with Parch x Deck_bin
In [174]:
plot_survival_heatmap(prepared_train_df, 'Parch', 'Deck_bin', cmap=custom_cmap)
No description has been provided for this image
SibSp x Deck_bin¶
  • Similar story to Parch
  • Survival rate for solo and family travelers significantly higher on Decks BDE and C.
  • Next Step: Chi-squared test survival association with SibSp x Deck_bin
In [177]:
plot_survival_heatmap(prepared_train_df, 'SibSp', 'Deck_bin', cmap=custom_cmap)
No description has been provided for this image
Deck x Cabin_Location_s¶
  • Sample sizes across classes not large enough.
  • Next Step: Skip engineering feature for now.
In [180]:
plot_survival_heatmap(prepared_train_df, 'Deck', 'Cabin_Location_s', cmap=custom_cmap)
No description has been provided for this image
Pclass x Title_bin¶
  • Redundant to Pclass x Sex, with smaller sample sizes
  • Provides distinct survival rates between Master and Mr titles supported by n > 20.
  • Next Step: Chi-squared test survival association with Pclass x Title_bin
In [183]:
def create_feature_Title_bin(train_df, test_df):
    def bin_title(title):
        if title in ['Mr.', 'Don.', 'Sir.', 'Jonkheer.']:
            return 'Mr'
        elif title == 'Master.':
            return 'Master'
        elif title in ['Mrs.', 'Mme.', 'Lady.', 'Countess.']:
            return 'Mrs'
        elif title in ['Miss.', 'Ms.', 'Mlle.']:
            return 'Miss'
        else:
            return 'Other'

    for df in [train_df, test_df]:
        df['Title_bin'] = df['Title'].apply(bin_title).astype('category').astype(str)

create_feature_Title_bin(prepared_train_df, prepared_test_df)
In [184]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Title_bin', cmap=custom_cmap)
No description has been provided for this image
Sex x Title_bin¶
  • Provides distinct survival rates between Master/Mr and Miss/Mrs
  • Next Step: Chi-squared test survival association with Sex x Title_bin.
In [187]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'Title_bin', cmap=custom_cmap)
No description has been provided for this image
Pclass x Age_Group¶
  • Provides distinct survival rates across classes and groups.
  • Next Step: Chi-squared test survival association with Pclass x Age_Group.
In [190]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Age_Group', cmap=custom_cmap)
No description has been provided for this image
Sex x Age_Group¶
  • Provides distinct survival rates across classes and groups.
  • Next Step: Chi-squared test survival association with Sex x Age_Group.
In [193]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'Age_Group', cmap=custom_cmap)
No description has been provided for this image
Pclass x FPP_log_bin¶
  • Relatively small variance across fares across ticket classes
  • P2 passenger survival increased 6% from Fare Bin 4->5
  • Next Step: Chi-squared test survival association.
In [196]:
def create_feature_Fare_per_person_log(train_df, test_df, target_col='Survived', n_splits=5, random_state=42):
    """
    Creates Fare_Per_Person, Fare_Per_Person_log, and Fare_Per_Person_log_bin features.
    This version prevents data leakage:
    - For train_df: Fare per person is computed out-of-fold (ticket counts from K-1 folds only).
    - For test_df: Fare per person is computed using ticket counts from full training data.
    - All log features are cast to float32 to maintain dtype consistency.

    Args:
        train_df (DataFrame): Training dataset containing 'Fare' and 'Ticket' columns.
        test_df (DataFrame): Test dataset containing 'Fare' and 'Ticket' columns.
        target_col (str): Column used for stratification (default: 'Survived').
        n_splits (int): Number of Stratified K-Folds.
        random_state (int): Random seed for reproducibility.

    Returns:
        None. Modifies train_df and test_df in-place.
    """
    oof_fare_per_person_log = pd.Series(index=train_df.index, dtype=np.float32)
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    for train_idx, val_idx in skf.split(train_df, train_df[target_col]):
        fold_train = train_df.iloc[train_idx]
        fold_val = train_df.iloc[val_idx]

        fold_ticket_counts = fold_train['Ticket'].value_counts()

        val_ticket_counts = train_df.loc[val_idx, 'Ticket'].map(fold_ticket_counts).fillna(1)
        fare_per_person = train_df.loc[val_idx, 'Fare'] / val_ticket_counts
        oof_fare_per_person_log.iloc[val_idx] = np.log1p(fare_per_person).astype(np.float32)

    train_df['Fare_Per_Person_log'] = oof_fare_per_person_log
    train_df['Fare_Per_Person'] = np.expm1(train_df['Fare_Per_Person_log']).astype(np.float32)

    # Test set: use full train ticket counts
    full_ticket_counts = train_df['Ticket'].value_counts()
    test_ticket_counts = test_df['Ticket'].map(full_ticket_counts).fillna(1)
    test_df['Fare_Per_Person'] = (test_df['Fare'] / test_ticket_counts).astype(np.float32)
    test_df['Fare_Per_Person_log'] = np.log1p(test_df['Fare_Per_Person']).astype(np.float32)

    # Binning based on training data distribution
    bins = np.quantile(train_df['Fare_Per_Person_log'], q=np.linspace(0, 1, 6))
    bins[0] = -np.inf
    bins[-1] = np.inf
    train_df['Fare_Per_Person_log_bin'] = pd.cut(train_df['Fare_Per_Person_log'], bins=bins, labels=False)
    test_df['Fare_Per_Person_log_bin'] = pd.cut(test_df['Fare_Per_Person_log'], bins=bins, labels=False)

    print("✅ Fare_Per_Person_log and bins added to train and test sets (leakage prevented, dtype safe).")

        
create_feature_Fare_per_person_log(prepared_train_df, prepared_test_df)
✅ Fare_Per_Person_log and bins added to train and test sets (leakage prevented, dtype safe).
In [197]:
def create_feature_Fare_per_person_log_bin(train_df, test_df):
    """
    Creates "Fare_Per_Person", "Fare_Per_Person_log", and "Fare_Per_Person_log_bin" features.
    Applies quantile-based binning to "Fare_Per_Person_log" using training data to minimize distribution shift.
    Bin edges are expanded to include ±infinity to ensure all test values map to valid bins.

    Args:
        train_df (DataFrame): Training data set
        test_df (DataFrame): Test data set

    Returns:
        Nothing
    """
    # Calculate ticket counts from training data
    training_ticket_counts = train_df['Ticket'].value_counts()

    for df in [train_df, test_df]:
        ticket_counts = df['Ticket'].map(training_ticket_counts).fillna(1)
        df['Fare_Per_Person'] = df['Fare'] / ticket_counts
        df['Fare_Per_Person_log'] = np.log1p(df['Fare_Per_Person'])

    # Compute quantile-based bins from training set
    qcut_bins = pd.qcut(train_df['Fare_Per_Person_log'], q=6, duplicates='drop', retbins=True)[1]
    qcut_bins[0] = -np.inf  # Extend first bin edge to -inf
    qcut_bins[-1] = np.inf  # Extend last bin edge to +inf

    # Create bin labels
    bin_labels = [f"Bin {i+1}" for i in range(len(qcut_bins) - 1)]
    cat_dtype = pd.api.types.CategoricalDtype(categories=bin_labels, ordered=True)

    for df in [train_df, test_df]:
        df['FPP_log_bin'] = pd.cut(
            df['Fare_Per_Person_log'],
            bins=qcut_bins,
            labels=bin_labels,
            include_lowest=True
        ).astype(cat_dtype)
        
create_feature_Fare_per_person_log_bin(prepared_train_df, prepared_test_df)
In [198]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'FPP_log_bin', cmap=custom_cmap)
No description has been provided for this image
Sex x FPP_log_bin¶
  • Reveals lower survival probabilities for females across FPP Bins 1 through 4.
  • Next Step: Chi-squared test survival association with Sex x FPP_log_bin.
In [201]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'FPP_log_bin', cmap=custom_cmap)
No description has been provided for this image
Pclass x Parch_SibSp¶
  • Distinct survival rates across ticket classes and Parch_SibSp 0->2
  • Next Steps: Chi-squared test survival association with Pclass x Parch_SibSp; Bin Parch_SibSp >= 4
In [204]:
def create_feature_Parch_SibSp(train_df, test_df):
    for df in [train_df, test_df]:
        df['Parch_SibSp'] = df['Parch'] + df['SibSp']

create_feature_Parch_SibSp(prepared_train_df, prepared_test_df)
In [205]:
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Parch_SibSp', cmap=custom_cmap)
No description has been provided for this image
Sex x Parch_SibSp¶
  • Distinct survival rates across sexes and Parch_SibSp 0->3
  • Next Step: Chi-squared test survival association with Sex x Parch_SibSp
In [208]:
plot_survival_heatmap(prepared_train_df, 'Sex', 'Parch_SibSp', cmap=custom_cmap)
No description has been provided for this image
HasCabin x Parch_SibSp¶
  • Survival behavior increased for passengers without cabin designations from Parch_Sibp 0->3
  • Survival behavior increased more sharply from 0->1 then stayed similar
  • Next Step: Bin Parch_SibSp 2+ or 3+ and chi-squared test survival association
In [211]:
plot_survival_heatmap(prepared_train_df, 'HasCabin', 'Parch_SibSp', cmap=custom_cmap)
No description has been provided for this image

Hi-Cardinality Features¶

Ticket¶
  • We can assume that passengers sharing the same Ticket are part of the same traveling group.
  • Ideas for Feature Engineering:
    • Create a "ticket frequency" feature that counts the number of training fold training set passengers that share the same Ticket values and map that frequency number to passengers in the test data set that share the same combination of feature values.
In [215]:
ticket_counts = prepared_train_df['Ticket'].value_counts()

shared_counts = (
    ticket_counts
    .value_counts()
    .rename_axis('Ticket_Frequency')
    .reset_index(name='Passenger_Count')
    .sort_values('Ticket_Frequency')
)

print(shared_counts)
   Ticket_Frequency  Passenger_Count
0                 1              547
1                 2               94
2                 3               21
3                 4               11
6                 5                2
5                 6                3
4                 7                3

Feature Priority Based on EDA¶

  • The following list is ordered by most potentially valuable insights / predictive power first.
  • Given the long list and limited timeframe, a subset of identified features will be prioritized and iterated upon during Model Development phase.
Feature Domain Insights Stats Notes
Pclass x Sex Females: 1st/2nd class: +90% survival, 3rd: 50%; Males: Decreasing survival from 1st 37% to 3rd 14% Clear survival differences between groups; high sample count support for further grouping
Pclass x Age_Group Distinct survival rates across class and ages Sample sizes border line low n ~ 10)
Pclass x HasCabin Having a cabin designation increased survival across all classes (Range: 0.50 (P3), 0.81 (P2)) Low sample sizes for P2/P3 HasCabin=True (n < 16)
Sex x HasCabin Having a cabin designation boosted survival ~30% for both sexes Well supported across 4 classes
Embarked x HasCabin Having cabin designation boosted survival 10%-30% across the embarkation points Low samples for Queenstown/HasCabin=True
HasCabin x Parch_SibSp Survival rate increased at differing rates between have/have-not for 0->3 Low samples for 3+ (n < 20)
Pclass x Parch_SibSp Showed survival rate increased across classes for family sizes 0->3 Sample sizes n < 20 for Parch_SibSp 2+
Sex x Parch_SibSp Showed survival rate increased across sexes for family sizes 0->3 n < 20 for Parch_SibSp 3+
Pclass x Embarked Southamption 3rd class perished most (0.19, n=353), Cherbourg 1st class perished least (0.69, n=85) Smooth feature to account for spare categories
Sex x Embarked Queenstown males had lowest survival (0.07, n=41); Cherbourg females had highest (0.88, n=73) Well supported (n > 30) across groups
Pclass x Deck_bin P1 Survival rates similar for Decks BDE and AM; BDE Decks had highest survival (~75%) Low samples for non-M non-P1 groups (n < 10)
Pclass x Cabin_Location_s P1 passengers on starboard side had 10% higher survival than port Low samples (n < 10) for other port/starboard classes
Pclass x Title Shows 1st/2nd Male young boys had 100% survival Sample size only 12, start with smoothed feature but may need to replace with boolean
Sex X FPP_log_bin Lower survival probabilities for females across FPP Bins 1 through 4 (42%->74%) Groups well supported

Cross-Fold Distribution Shift Analysis¶

  • For each variable, and features-to-be engineered, average Kullback-Leibler (KL) divergence is calculated between training and validation folds of the training set to quantify distribution shift and assess generalizability to unseen data.
  • KL divergence calculated from training set data only to mitigate data leakage.
  • Average KL divergence is calculated from the KL divergences between K-1 training folds and 1 "unseen" validation fold for 5 cross-validation iterations.
  • Threshold for significant Cross-Fold (CF) Distribution Shift is KL >= 0.02
In [221]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from scipy.stats import entropy

def kl_divergence(p, q, smooth=1e-6):
    """KL divergence using scipy.stats.entropy with smoothing."""
    p = np.asarray(p) + smooth
    q = np.asarray(q) + smooth
    return entropy(p, q)

def evaluate_feature_kl_divergence(
    df,
    feature_list,
    target_col='Survived',
    n_splits=5,
    random_state=42,
    auto_bin_strategy='quantile',  # or 'uniform'
    n_bins=10
):
    """
    Evaluate KL divergence between train and val distributions for categorical or continuous features.

    Args:
        df (pd.DataFrame): Input DataFrame.
        feature_list (list): List of feature names (str or list/tuple of two features).
        target_col (str): Target column for stratification.
        n_splits (int): Number of cross-validation folds.
        random_state (int): Random seed.
        auto_bin_strategy (str): 'quantile' or 'uniform' binning for continuous variables.
        n_bins (int): Number of bins if binning is applied.

    Returns:
        styled (pd.io.formats.style.Styler): Highlighted summary table.
        df_results (pd.DataFrame): Full results table.
    """
    results = []
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    def compute_bin_edges(series):
        if auto_bin_strategy == 'quantile':
            return np.unique(np.quantile(series.dropna(), np.linspace(0, 1, n_bins + 1)))
        elif auto_bin_strategy == 'uniform':
            return np.linspace(series.min(), series.max(), n_bins + 1)
        else:
            raise ValueError("Invalid auto_bin_strategy: choose 'quantile' or 'uniform'")

    def bin_if_continuous(series, bin_edges):
        return pd.cut(series, bins=bin_edges, include_lowest=True)

    for feature in feature_list:
        kl_values = []

        for train_idx, val_idx in skf.split(df, df[target_col]):
            fold_train = df.iloc[train_idx].copy()
            fold_val = df.iloc[val_idx].copy()

            if isinstance(feature, (list, tuple)):
                feat_name = " x ".join(feature)
                for col in feature:
                    if pd.api.types.is_numeric_dtype(fold_train[col]) and fold_train[col].nunique() > 25:
                        print(f"Binning {feature}: {fold_train[feature].nunique()} unique values")
                        edges = compute_bin_edges(fold_train[col])
                        fold_train[col] = bin_if_continuous(fold_train[col], edges)
                        fold_val[col] = bin_if_continuous(fold_val[col], edges)
                train_counts = fold_train.groupby(list(feature), observed=True).size()
                val_counts = fold_val.groupby(list(feature), observed=True).size()
            else:
                feat_name = feature
                if pd.api.types.is_numeric_dtype(fold_train[feature]) and fold_train[feature].nunique() > 25:
                    edges = compute_bin_edges(fold_train[feature])
                    fold_train[feature] = bin_if_continuous(fold_train[feature], edges)
                    fold_val[feature] = bin_if_continuous(fold_val[feature], edges)
                train_counts = fold_train[feature].value_counts()
                val_counts = fold_val[feature].value_counts()

            # Normalize to distributions
            train_dist = train_counts / train_counts.sum()
            val_dist = val_counts / val_counts.sum()

            # Align keys
            all_keys = train_dist.index.union(val_dist.index)
            train_dist = train_dist.reindex(all_keys, fill_value=0)
            val_dist = val_dist.reindex(all_keys, fill_value=0)

            kl = kl_divergence(train_dist, val_dist)
            kl_values.append(kl)

        result = {
            'Feature': feat_name,
            'Avg_KL_Divergence': np.mean(kl_values),
            'Min_KL_Divergence': np.min(kl_values),
            'Max_KL_Divergence': np.max(kl_values),
            'Std_KL_Divergence': np.std(kl_values)
        }

        for i, val in enumerate(kl_values):
            result[f'Fold_{i+1}_KL'] = val

        results.append(result)

    df_results = pd.DataFrame(results).sort_values(by='Avg_KL_Divergence', ascending=False)

    def highlight_kl(s):
        return ['background-color: yellow' if v >= 0.02 else '' for v in s]

    styled = df_results.style.apply(highlight_kl, subset=['Avg_KL_Divergence'])
    return styled, df_results
In [222]:
features_to_evaluate = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked', 'HasCabin', 'Cabin_count', 'Cabin_Location_s',
                        'Deck', 'Title', 'Age', 'Age_Group', 'Fare', 'FPP_log_bin', 'Parch_SibSp', ['Pclass', 'Sex'], ['Pclass', 'Title'], ['Pclass', 'Parch'],
                        ['Pclass', 'SibSp'], ['Sex', 'Parch'], ['Sex', 'SibSp'], ['Pclass', 'Embarked'], ['Sex', 'Embarked'], 
                        ['Pclass', 'HasCabin'], ['SibSp', 'HasCabin'], ['Parch', 'HasCabin'], ['SibSp', 'HasCabin'],
                        ['Embarked', 'HasCabin'], ['Pclass', 'Cabin_count'], ['Sex', 'Cabin_count'], ['Pclass', 'Cabin_Location_s'],
                        ['Sex', 'Cabin_Location_s'], ['Pclass', 'Deck_bin'], ['Sex', 'Deck_bin'], ['Parch', 'Deck_bin'], ['SibSp', 'Deck_bin'],
                        ['Deck', 'Cabin_Location_s'], ['Pclass', 'Title_bin'], ['Sex', 'Title_bin'], ['Pclass', 'Age_Group'], ['Sex', 'Age_Group'],
                        ['Pclass', 'FPP_log_bin'], ['Sex', 'FPP_log_bin'], ['Pclass', 'Parch_SibSp'], ['Sex', 'Parch_SibSp']]
In [223]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, features_to_evaluate)
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
16 Pclass x Title 0.233912 0.151928 0.319446 0.053121 0.151928 0.226165 0.237274 0.319446 0.234747
34 Parch x Deck_bin 0.189282 0.087900 0.319458 0.092661 0.138525 0.280541 0.119985 0.319458 0.087900
44 Sex x Parch_SibSp 0.181378 0.048391 0.339622 0.110240 0.048391 0.259522 0.187058 0.339622 0.072297
9 Title 0.176547 0.093651 0.285688 0.064112 0.093651 0.191216 0.137166 0.285688 0.175013
43 Pclass x Parch_SibSp 0.165862 0.070471 0.253850 0.076717 0.077618 0.224291 0.070471 0.253850 0.203078
37 Pclass x Title_bin 0.156349 0.064196 0.266336 0.066432 0.064196 0.122950 0.175081 0.266336 0.153180
36 Deck x Cabin_Location_s 0.138029 0.088532 0.184896 0.035618 0.088532 0.171475 0.129813 0.184896 0.115430
30 Pclass x Cabin_Location_s 0.124354 0.060462 0.187865 0.042098 0.060462 0.136060 0.135184 0.102198 0.187865
20 Sex x SibSp 0.120758 0.020390 0.188893 0.067905 0.063143 0.188893 0.143508 0.187856 0.020390
18 Pclass x SibSp 0.113705 0.071795 0.211615 0.050058 0.086212 0.095968 0.102938 0.211615 0.071795
17 Pclass x Parch 0.108081 0.058864 0.159577 0.033182 0.119175 0.159577 0.111851 0.058864 0.090935
41 Pclass x FPP_log_bin 0.103770 0.006048 0.200233 0.065205 0.076166 0.093450 0.200233 0.142951 0.006048
35 SibSp x Deck_bin 0.103751 0.063224 0.205703 0.051705 0.079703 0.090234 0.079892 0.205703 0.063224
19 Sex x Parch 0.101998 0.071684 0.123776 0.022449 0.115695 0.123776 0.120930 0.071684 0.077904
32 Pclass x Deck_bin 0.097972 0.021305 0.156873 0.058168 0.036229 0.021305 0.156873 0.120449 0.155007
39 Pclass x Age_Group 0.092701 0.029594 0.189176 0.057705 0.120718 0.189176 0.080586 0.043430 0.029594
25 Parch x HasCabin 0.086801 0.033732 0.127121 0.036505 0.122509 0.127121 0.093794 0.033732 0.056850
28 Pclass x Cabin_count 0.080167 0.032598 0.210130 0.065470 0.050717 0.032598 0.055862 0.051528 0.210130
3 Parch 0.070782 0.014046 0.120372 0.035046 0.120372 0.083354 0.080111 0.014046 0.056025
26 SibSp x HasCabin 0.069913 0.025272 0.102681 0.026465 0.074920 0.025272 0.087253 0.102681 0.059437
24 SibSp x HasCabin 0.069913 0.025272 0.102681 0.026465 0.074920 0.025272 0.087253 0.102681 0.059437
29 Sex x Cabin_count 0.066488 0.028877 0.110650 0.027259 0.070893 0.110650 0.049367 0.028877 0.072651
38 Sex x Title_bin 0.065952 0.018694 0.238731 0.086440 0.022777 0.027864 0.018694 0.238731 0.021693
14 Parch_SibSp 0.062126 0.010545 0.171704 0.061770 0.011865 0.171704 0.010545 0.088703 0.027815
8 Deck 0.050886 0.023459 0.077066 0.020708 0.023459 0.056705 0.030404 0.077066 0.066794
21 Pclass x Embarked 0.050665 0.017766 0.071438 0.020350 0.036846 0.059818 0.071438 0.067456 0.017766
31 Sex x Cabin_Location_s 0.049701 0.026488 0.063176 0.012454 0.048990 0.063176 0.055036 0.026488 0.054816
23 Pclass x HasCabin 0.045572 0.005157 0.151005 0.054150 0.030186 0.005157 0.035636 0.005874 0.151005
33 Sex x Deck_bin 0.038423 0.010414 0.128073 0.044959 0.018159 0.010414 0.014698 0.128073 0.020769
42 Sex x FPP_log_bin 0.037669 0.007717 0.077980 0.024528 0.077980 0.051409 0.027482 0.023759 0.007717
40 Sex x Age_Group 0.037536 0.015729 0.066647 0.023436 0.018944 0.066647 0.065694 0.015729 0.020665
12 Fare 0.035025 0.018682 0.067847 0.017375 0.067847 0.036000 0.023872 0.018682 0.028724
7 Cabin_Location_s 0.035010 0.019988 0.062994 0.018240 0.020678 0.062994 0.020721 0.019988 0.050667
6 Cabin_count 0.029722 0.005008 0.073406 0.023047 0.023223 0.026385 0.020590 0.005008 0.073406
27 Embarked x HasCabin 0.028635 0.010627 0.050821 0.016619 0.014879 0.050821 0.046258 0.010627 0.020591
10 Age 0.027167 0.006812 0.036574 0.010653 0.036574 0.027190 0.034200 0.006812 0.031057
2 SibSp 0.023899 0.003725 0.070919 0.024648 0.004250 0.018335 0.022264 0.070919 0.003725
13 FPP_log_bin 0.020249 0.001980 0.046791 0.016337 0.046791 0.029615 0.016580 0.006279 0.001980
22 Sex x Embarked 0.018854 0.011960 0.029482 0.006015 0.017022 0.011960 0.029482 0.020665 0.015141
11 Age_Group 0.010305 0.003488 0.015674 0.004456 0.014158 0.015674 0.010868 0.003488 0.007339
15 Pclass x Sex 0.009868 0.004667 0.021669 0.006045 0.004667 0.008709 0.021669 0.007522 0.006772
4 Embarked 0.007985 0.003290 0.013197 0.003653 0.013197 0.004446 0.003290 0.009964 0.009027
0 Pclass 0.004215 0.000181 0.011979 0.004194 0.004356 0.003669 0.011979 0.000181 0.000889
1 Sex 0.003707 0.000391 0.008590 0.002726 0.000391 0.002468 0.008590 0.002939 0.004146
5 HasCabin 0.000622 0.000080 0.002048 0.000730 0.000552 0.000214 0.000214 0.000080 0.002048

Feature Engineering¶

Reduce Distribution Shift of Select Features¶

Pclass x Age_Group¶
  • Given the large differences between Age distributions per ticket class, I'm forgoing attempting to create a feature that tries to identify similarities across them.
  • See Pclass x Sex x Age_Group for Age_Group-based feature.
In [228]:
sns.kdeplot(data=prepared_train_df, x="Age", hue="Pclass")
plt.title("Large Age Distribution Differences between Ticket Classes")
plt.show()
No description has been provided for this image
Pclass_HasCabin¶

Creating (n < 30)-binned Pclass_HasCabin reduced KL acceptable levels (KL = 0.015919).

In [231]:
def create_feature_Pclass_HasCabin(train_df, test_df):
    """
    Creates a composite feature 'Pclass_HasCabin' by combining:
    - Pclass (1, 2, or 3)
    - HasCabin (converted to 0/1)

    Any combination with fewer than 30 samples in the training set is binned into 'Rare'.

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Pclass_HasCabin' column to both dataframes)
    """
    def combine(pclass, has_cabin):
        return f"{pclass}_{int(has_cabin)}"

    # Generate raw composite keys
    train_keys = train_df.apply(lambda row: combine(row['Pclass'], row['HasCabin']), axis=1)
    
    # Count frequencies and find common groups
    value_counts = train_keys.value_counts()
    common_groups = value_counts[value_counts >= 30].index

    def assign_or_rare(pclass, has_cabin):
        key = f"{pclass}_{int(has_cabin)}"
        return key if key in common_groups else "Rare"

    for df in [train_df, test_df]:
        df['Pclass_HasCabin'] = df.apply(lambda row: assign_or_rare(row['Pclass'], row['HasCabin']), axis=1).astype(str)

    print("Created 'Pclass_HasCabin' in train_df and test_df (groups < 30 binned to 'Rare').")


create_feature_Pclass_HasCabin(prepared_train_df, prepared_test_df)
Created 'Pclass_HasCabin' in train_df and test_df (groups < 30 binned to 'Rare').
In [232]:
prepared_train_df['Pclass_HasCabin'].value_counts()
Out[232]:
Pclass_HasCabin
3_0     479
1_1     176
2_0     168
1_0      40
Rare     28
Name: count, dtype: int64
In [233]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_HasCabin'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Pclass_HasCabin 0.015919 0.005076 0.035123 0.013076 0.028302 0.005076 0.035123 0.005405 0.005687
Sex x HasCabin¶

Created Sex_HasCabin - exhibited negligble CF distribution shift (KL = 0.005846).

In [236]:
def create_feature_Sex_HasCabin(train_df, test_df):
    """
    Creates a composite feature 'Sex_HasCabin' by combining:
    - Sex (e.g., 'male' or 'female')
    - HasCabin (converted to 0 or 1)

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Sex_HasCabin' column to both dataframes)
    """
    def combine(sex, has_cabin):
        return f"{sex}_{int(has_cabin)}"

    for df in [train_df, test_df]:
        df['Sex_HasCabin'] = df.apply(lambda row: combine(row['Sex'], row['HasCabin']), axis=1).astype(str)

    print("Created 'Sex_HasCabin' in train_df and test_df.")

create_feature_Sex_HasCabin(prepared_train_df, prepared_test_df)
Created 'Sex_HasCabin' in train_df and test_df.
In [237]:
prepared_train_df['Sex_HasCabin'].value_counts()
Out[237]:
Sex_HasCabin
male_0      470
female_0    217
male_1      107
female_1     97
Name: count, dtype: int64
In [238]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Sex_HasCabin'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Sex_HasCabin 0.005846 0.003257 0.009534 0.002385 0.003257 0.004473 0.009534 0.004221 0.007743
Embarked x HasCabin¶

Created Embarked_HasCabin - Exhibited CF distribution shift (KL = 0.028635) due to rare Q_1 scenario. Will keep as it will be excluded during smoothed feature translation.

In [241]:
def create_feature_Embarked_HasCabin(train_df, test_df):
    """
    Creates a composite feature 'Embarked_HasCabin' by combining:
    - Embarked (C, Q, S)
    - HasCabin (as 0 or 1)

    Example values: 'S_1', 'C_0'

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Embarked_HasCabin' column to both dataframes)
    """
    def combine(embarked, has_cabin):
        return f"{embarked}_{int(has_cabin)}"

    for df in [train_df, test_df]:
        df['Embarked_HasCabin'] = df.apply(
            lambda row: combine(row['Embarked'], row['HasCabin']), axis=1
        ).astype(str)

    print("Created 'Embarked_HasCabin' in train_df and test_df.")

create_feature_Embarked_HasCabin(prepared_train_df, prepared_test_df)
Created 'Embarked_HasCabin' in train_df and test_df.
In [242]:
prepared_train_df['Embarked_HasCabin'].value_counts()
Out[242]:
Embarked_HasCabin
S_0    515
S_1    131
C_0     99
Q_0     73
C_1     69
Q_1      4
Name: count, dtype: int64
In [243]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Embarked_HasCabin'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Embarked_HasCabin 0.028635 0.010627 0.050821 0.016619 0.014879 0.050821 0.046258 0.010627 0.020591
Parch_SibSp_bin¶

Creating new Parch_SibSp_bin reduced Cross-Fold (CF) Distribution Shift to negligible levels (KL = 0.011754).

In [246]:
def create_feature_Parch_SibSp_bin(train_df, test_df):
    """
    Creates a binned version of the 'Parch_SibSp' feature:
    - Values >= 4 → '4+'
    - All other values are converted to strings of their actual value

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Parch_SibSp_bin' column to both dataframes)
    """
    def bin_value(x):
        return '4+' if x >= 4 else str(x)

    for df in [train_df, test_df]:
        df['Parch_SibSp_bin'] = df['Parch_SibSp'].apply(bin_value).astype(str)

    print("Created 'Parch_SibSp_bin' in train_df and test_df with '4+' bin for values >= 4.")

create_feature_Parch_SibSp_bin(prepared_train_df, prepared_test_df)
Created 'Parch_SibSp_bin' in train_df and test_df with '4+' bin for values >= 4.
In [247]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Parch_SibSp_bin'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Parch_SibSp_bin 0.011754 0.006239 0.020432 0.004932 0.006239 0.020432 0.008108 0.010876 0.013114
HasCabin x Parch_SibSp_bin¶

Created HasCabin_Parch_SibSp_bin, binning Parch_SibSp_bin with 3+; Exhibited negligble CF distribution shift (KL = 0.019378).

In [250]:
def create_feature_HasCabin_Parch_SibSp_bin(train_df, test_df):
    """
    Creates a composite feature 'HasCabin_Parch_SibSp_bin' by:
    - Summing Parch + SibSp
    - Binning the result into '0', '1', '2', or '3+'
    - Concatenating with HasCabin (converted to 0 or 1)

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'HasCabin_Parch_SibSp_bin' to both dataframes)
    """
    def bin_family_size(n):
        if n == 0:
            return '0'
        elif n == 1:
            return '1'
        elif n == 2:
            return '2'
        else:
            return '3+'

    def combine(has_cabin, parch, sibsp):
        family_size_bin = bin_family_size(parch + sibsp)
        return f"{int(has_cabin)}_{family_size_bin}"

    for df in [train_df, test_df]:
        df['HasCabin_Parch_SibSp_bin'] = df.apply(
            lambda row: combine(row['HasCabin'], row['Parch'], row['SibSp']), axis=1
        ).astype(str)

    print("Created 'HasCabin_Parch_SibSp_bin' in train_df and test_df.")

create_feature_HasCabin_Parch_SibSp_bin(prepared_train_df, prepared_test_df)
Created 'HasCabin_Parch_SibSp_bin' in train_df and test_df.
In [251]:
prepared_train_df['HasCabin_Parch_SibSp_bin'].value_counts()
Out[251]:
HasCabin_Parch_SibSp_bin
0_0     443
0_1      95
1_0      94
0_3+     76
0_2      73
1_1      66
1_2      29
1_3+     15
Name: count, dtype: int64
In [252]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['HasCabin_Parch_SibSp_bin'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 HasCabin_Parch_SibSp_bin 0.019378 0.006888 0.029617 0.008861 0.013843 0.029221 0.006888 0.029617 0.017321
Pclass x Parch_SibSp_bin¶

Created Pclass_Parch_SibSp_bin, binning Parch_SibSp_bin with 1+; Exhibited negligble CF distribution shift (KL = 0.013836).

In [255]:
def create_feature_Pclass_Parch_SibSp_bin(train_df, test_df):
    """
    Creates a composite feature 'Pclass_Parch_SibSp_bin' by:
    - Summing Parch + SibSp
    - Binning the sum into '0' or '1+'
    - Concatenating with Pclass

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Pclass_Parch_SibSp_bin' to both dataframes)
    """
    def bin_family_size(n):
        return '0' if n == 0 else '1+'

    def combine(pclass, parch, sibsp):
        family_size_bin = bin_family_size(parch + sibsp)
        return f"{pclass}_{family_size_bin}"

    for df in [train_df, test_df]:
        df['Pclass_Parch_SibSp_bin'] = df.apply(
            lambda row: combine(row['Pclass'], row['Parch'], row['SibSp']), axis=1
        ).astype(str)

    print("Created 'Pclass_Parch_SibSp_bin' in train_df and test_df.")

create_feature_Pclass_Parch_SibSp_bin(prepared_train_df, prepared_test_df)
Created 'Pclass_Parch_SibSp_bin' in train_df and test_df.
In [256]:
prepared_train_df['Pclass_Parch_SibSp_bin'].value_counts()
Out[256]:
Pclass_Parch_SibSp_bin
3_0     324
3_1+    167
1_0     109
1_1+    107
2_0     104
2_1+     80
Name: count, dtype: int64
In [257]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_Parch_SibSp_bin'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Pclass_Parch_SibSp_bin 0.013836 0.001541 0.024130 0.007736 0.013611 0.019396 0.024130 0.010499 0.001541
In [258]:
survival_df = (
    prepared_train_df
    .groupby("Pclass_Parch_SibSp_bin", observed=True)
    .agg(Survival_Rate=('Survived', 'mean'), 
         Count=('Survived', 'size')
        )
    .reset_index()
    .sort_values(by="Pclass_Parch_SibSp_bin", ascending=True)
)
survival_df
Out[258]:
Pclass_Parch_SibSp_bin Survival_Rate Count
0 1_0 0.532110 109
1 1_1+ 0.728972 107
2 2_0 0.346154 104
3 2_1+ 0.637500 80
4 3_0 0.212963 324
5 3_1+ 0.299401 167
Sex x Parch_SibSp_bin¶

Created Sex_Parch_SibSp_bin, binning Parch_SibSp_bin with 1+; Exhibited negligble CF distribution shift (KL = 0.015822).

In [261]:
def create_feature_Sex_Parch_SibSp_bin(train_df, test_df):
    """
    Creates a composite feature 'Sex_Parch_SibSp_bin' by:
    - Summing Parch + SibSp
    - Binning the sum into '0' or '1+'
    - Concatenating with Sex (e.g., 'male_1+', 'female_0')

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Sex_Parch_SibSp_bin' to both dataframes)
    """
    def bin_family_size(n):
        return '0' if n == 0 else '1+'

    def combine(sex, parch, sibsp):
        family_size_bin = bin_family_size(parch + sibsp)
        return f"{sex}_{family_size_bin}"

    for df in [train_df, test_df]:
        df['Sex_Parch_SibSp_bin'] = df.apply(
            lambda row: combine(row['Sex'], row['Parch'], row['SibSp']), axis=1
        ).astype(str)

    print("Created 'Sex_Parch_SibSp_bin' in train_df and test_df.")

create_feature_Sex_Parch_SibSp_bin(prepared_train_df, prepared_test_df)
Created 'Sex_Parch_SibSp_bin' in train_df and test_df.
In [262]:
prepared_train_df['Sex_Parch_SibSp_bin'].value_counts()
Out[262]:
Sex_Parch_SibSp_bin
male_0       411
female_1+    188
male_1+      166
female_0     126
Name: count, dtype: int64
In [263]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Sex_Parch_SibSp_bin'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Sex_Parch_SibSp_bin 0.015822 0.000836 0.033512 0.012422 0.000836 0.033512 0.027176 0.007020 0.010564
In [264]:
survival_df = (
    prepared_train_df
    .groupby("Sex_Parch_SibSp_bin", observed=True)
    .agg(Survival_Rate=('Survived', 'mean'), 
         Count=('Survived', 'size')
        )
    .reset_index()
    .sort_values(by="Sex_Parch_SibSp_bin", ascending=True)
)
survival_df
Out[264]:
Sex_Parch_SibSp_bin Survival_Rate Count
0 female_0 0.785714 126
1 female_1+ 0.712766 188
2 male_0 0.155718 411
3 male_1+ 0.271084 166
Pclass x Embarked¶

Created (n < 30)-binned Pclass_Embarked - exhibited negligble CF distribution shift (KL = 0.018094).

In [267]:
def create_feature_Pclass_Embarked(train_df, test_df):
    """
    Creates a composite feature 'Pclass_Embarked' by combining:
    - Pclass (as an integer)
    - Embarked (as a character)

    Any combination that appears fewer than 20 times in the training set
    is binned into the 'Rare' category.

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Pclass_Embarked' column to both dataframes)
    """
    def combine(pclass, embarked):
        return f"{pclass}_{embarked}"

    # Build composite keys for the training set
    train_keys = train_df.apply(lambda row: combine(row['Pclass'], row['Embarked']), axis=1)

    # Count occurrences and identify common groups
    value_counts = train_keys.value_counts()
    common_groups = value_counts[value_counts >= 20].index

    def assign_or_rare(pclass, embarked):
        key = f"{pclass}_{embarked}"
        return key if key in common_groups else 'Rare'

    for df in [train_df, test_df]:
        df['Pclass_Embarked'] = df.apply(
            lambda row: assign_or_rare(row['Pclass'], row['Embarked']), axis=1
        ).astype(str)

    print("Created 'Pclass_Embarked' in train_df and test_df (groups < 20 binned to 'Rare').")

create_feature_Pclass_Embarked(prepared_train_df, prepared_test_df)
Created 'Pclass_Embarked' in train_df and test_df (groups < 20 binned to 'Rare').
In [268]:
prepared_train_df['Pclass_Embarked'].value_counts()
Out[268]:
Pclass_Embarked
3_S     353
2_S     164
1_S     129
1_C      85
3_Q      72
3_C      66
Rare     22
Name: count, dtype: int64
In [269]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_Embarked'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Pclass_Embarked 0.018094 0.011404 0.029573 0.007008 0.029573 0.011629 0.015340 0.022525 0.011404
Sex x Embarked¶

Created (n < 30)-binned Sex_Embarked - exhibited negligble CF distribution shift (KL = 0.018854).

In [272]:
def create_feature_Sex_Embarked(train_df, test_df):
    """
    Creates a composite feature 'Sex_Embarked' by combining:
    - Sex (e.g., 'male', 'female')
    - Embarked (e.g., 'C', 'Q', 'S')

    Any combination that appears fewer than 30 times in the training set
    is binned into the 'Rare' category.

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Sex_Embarked' column to both dataframes)
    """
    def combine(sex, embarked):
        return f"{sex}_{embarked}"

    # Build composite keys in the training set
    train_keys = train_df.apply(lambda row: combine(row['Sex'], row['Embarked']), axis=1)

    # Count and identify common combinations
    value_counts = train_keys.value_counts()
    common_groups = value_counts[value_counts >= 30].index

    def assign_or_rare(sex, embarked):
        key = f"{sex}_{embarked}"
        return key if key in common_groups else 'Rare'

    for df in [train_df, test_df]:
        df['Sex_Embarked'] = df.apply(
            lambda row: assign_or_rare(row['Sex'], row['Embarked']), axis=1
        ).astype(str)

    print("Created 'Sex_Embarked' in train_df and test_df (groups < 30 binned to 'Rare').")

create_feature_Sex_Embarked(prepared_train_df, prepared_test_df)
Created 'Sex_Embarked' in train_df and test_df (groups < 30 binned to 'Rare').
In [273]:
prepared_train_df['Sex_Embarked'].value_counts()
Out[273]:
Sex_Embarked
male_S      441
female_S    205
male_C       95
female_C     73
male_Q       41
female_Q     36
Name: count, dtype: int64
In [274]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Sex_Embarked'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Sex_Embarked 0.018854 0.011960 0.029482 0.006015 0.017022 0.011960 0.029482 0.020665 0.015141
Pclass x Deck_bin¶
  • Created (n < 10)-binned Pclass_Deck_bin - exhibited negligble CF distribution shift (KL = 0.013757).
  • Decks BDE and AM were binned together due to similar survival rates in EDA.
In [277]:
def create_feature_Pclass_Deck_bin(train_df, test_df):
    """
    Creates a composite feature 'Pclass_Deck_bin' by:
    - Binning Deck into: 'BDE', 'AM', or 'Other'
    - Concatenating with Pclass
    - Binning combinations with < 10 occurrences into 'Rare'

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Pclass_Deck_bin' to both dataframes)
    """
    def bin_deck(deck):
        if deck in ['B', 'D', 'E']:
            return 'BDE'
        elif deck in ['A', 'M']:
            return 'AM'
        else:
            return 'Other'

    def combine(pclass, deck):
        return f"{pclass}_{bin_deck(deck)}"

    # Build composite keys from training set
    train_keys = train_df.apply(lambda row: combine(row['Pclass'], row['Deck']), axis=1)
    value_counts = train_keys.value_counts()
    common_groups = value_counts[value_counts >= 10].index

    def assign_or_rare(pclass, deck):
        key = f"{pclass}_{bin_deck(deck)}"
        return key if key in common_groups else "Rare"

    for df in [train_df, test_df]:
        df['Pclass_Deck_bin'] = df.apply(
            lambda row: assign_or_rare(row['Pclass'], row['Deck']), axis=1
        ).astype(str)

    print("Created 'Pclass_Deck_bin' in train_df and test_df (groups < 10 binned to 'Rare').")

create_feature_Pclass_Deck_bin(prepared_train_df, prepared_test_df)
Created 'Pclass_Deck_bin' in train_df and test_df (groups < 10 binned to 'Rare').
In [278]:
prepared_train_df['Pclass_Deck_bin'].value_counts()
Out[278]:
Pclass_Deck_bin
3_AM       479
2_AM       168
1_BDE      101
1_Other     60
1_AM        55
Rare        28
Name: count, dtype: int64
In [279]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_Deck_bin'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Pclass_Deck_bin 0.013757 0.006021 0.021299 0.005887 0.021299 0.007718 0.017449 0.016297 0.006021
Pclass x Cabin_Location_s¶

Creating (n < 10)-binned Pclass_Cabin_Location_s reduced distribution shift to negligble levels (KL = 0.017970).

In [282]:
def create_feature_Pclass_Cabin_Location_s(train_df, test_df):
    """
    Creates a composite feature 'Pclass_Cabin_Location_s' by combining:
    - Pclass (1, 2, 3)
    - Cabin_Location_s (e.g., 'port', 'starboard', 'unknown')

    Groups with fewer than 10 occurrences in the training set are binned into 'Rare'.

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Pclass_Cabin_Location_s' column to both dataframes)
    """
    def combine(pclass, cabin_location):
        return f"{pclass}_{cabin_location}"

    # Compute composite keys in train
    train_keys = train_df.apply(lambda row: combine(row['Pclass'], row['Cabin_Location_s']), axis=1)
    value_counts = train_keys.value_counts()
    common_groups = value_counts[value_counts >= 10].index

    def assign_or_rare(pclass, cabin_location):
        key = f"{pclass}_{cabin_location}"
        return key if key in common_groups else "Rare"

    for df in [train_df, test_df]:
        df['Pclass_Cabin_Location_s'] = df.apply(
            lambda row: assign_or_rare(row['Pclass'], row['Cabin_Location_s']), axis=1
        ).astype(str)

    print("Created 'Pclass_Cabin_Location_s' in train_df and test_df (groups < 10 binned to 'Rare').")

create_feature_Pclass_Cabin_Location_s(prepared_train_df, prepared_test_df)
Created 'Pclass_Cabin_Location_s' in train_df and test_df (groups < 10 binned to 'Rare').
In [283]:
prepared_train_df['Pclass_Cabin_Location_s'].value_counts()
Out[283]:
Pclass_Cabin_Location_s
3_no_cabin_info    479
2_no_cabin_info    168
1_port              96
1_starboard         77
1_no_cabin_info     40
Rare                31
Name: count, dtype: int64
In [284]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_Cabin_Location_s'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Pclass_Cabin_Location_s 0.017970 0.006337 0.035650 0.011163 0.026185 0.006337 0.035650 0.012567 0.009113
Pclass x Normalized Title¶

Created Pclass_Title_bin - Exhibited CF distribution shift (KL = 0.025850) given low sample 12_Master category. Keeping for now to potentially benefit from the group; will ablation test during Model Development phase.

In [287]:
def create_feature_Pclass_Title_normalized(train_df, test_df):
    """
    Creates a composite feature 'Pclass_Title_normalized' by:
    - Normalizing Title based on Sex, SibSp, and Age
    - Concatenating with Pclass
    - Merging '1_Master' and '2_Master' into '12_Master'

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Pclass_Title_normalized' column to both dataframes)
    """
    def normalize_title(row):
        title = row['Title']
        sex = row['Sex']
        sibsp = row['SibSp']
        age = row['Age']

        if title in ['Mr', 'Mrs', 'Miss', 'Master']:
            return title
        if sex == 'male':
            if pd.notna(age) and age < 14:
                return 'Master'
            return 'Mr'
        else:
            return 'Mrs' if sibsp > 0 else 'Miss'

    def assign_pclass_title(row):
        pclass = row['Pclass']
        title = row['Normalized_Title']
        if title == 'Master' and pclass in [1, 2]:
            return '12_Master'
        return f"{pclass}_{title}"

    for df in [train_df, test_df]:
        df['Normalized_Title'] = df.apply(normalize_title, axis=1)
        df['Pclass_Title_normalized'] = df.apply(assign_pclass_title, axis=1).astype(str)

    print("Created 'Pclass_Title_normalized' in train_df and test_df with '12_Master' merged.")

create_feature_Pclass_Title_normalized(prepared_train_df, prepared_test_df)
Created 'Pclass_Title_normalized' in train_df and test_df with '12_Master' merged.
In [288]:
prepared_train_df['Pclass_Title_normalized'].value_counts()
Out[288]:
Pclass_Title_normalized
3_Mr         318
1_Mr         119
2_Mr          99
3_Miss        81
3_Mrs         63
1_Miss        49
1_Mrs         45
2_Miss        44
2_Mrs         32
3_Master      29
12_Master     12
Name: count, dtype: int64
In [289]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_Title_normalized'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Pclass_Title_normalized 0.025850 0.010231 0.039694 0.011595 0.010231 0.039694 0.038210 0.024290 0.016827
Deck_bin¶
  • Created Deck_bin - Exhibited negligible CF distribution shift (KL = 0.005735).
  • Decks BDE and AM were binned together due to similar survival rates in EDA.
In [292]:
def create_feature_Deck_bin(train_df, test_df):
    """
    Creates a binned feature 'Deck_bin' from the 'Deck' column:
    - 'BDE' if Deck is B, D, or E
    - 'AM' if Deck is A or M
    - 'Other' for all other values (including NaN)

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Deck_bin' to both dataframes)
    """
    def bin_deck(deck):
        if deck in ['B', 'D', 'E']:
            return 'BDE'
        elif deck in ['A', 'M']:
            return 'AM'
        else:
            return 'Other'

    for df in [train_df, test_df]:
        df['Deck_bin'] = df['Deck'].apply(bin_deck).astype(str)

    print("Created 'Deck_bin' in train_df and test_df.")

create_feature_Deck_bin(prepared_train_df, prepared_test_df)
Created 'Deck_bin' in train_df and test_df.
In [293]:
prepared_train_df['Deck_bin'].value_counts()
Out[293]:
Deck_bin
AM       702
BDE      112
Other     77
Name: count, dtype: int64
In [294]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Deck_bin'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Deck_bin 0.005735 0.000509 0.021163 0.007751 0.001993 0.002214 0.000509 0.021163 0.002795
In [295]:
survival_df = (
    prepared_train_df
    .groupby("Deck_bin", observed=True)
    .agg(Survival_Rate=('Survived', 'mean'), 
         Count=('Survived', 'size')
        )
    .reset_index()
    .sort_values(by="Deck_bin", ascending=True)
)
survival_df
Out[295]:
Deck_bin Survival_Rate Count
0 AM 0.303419 702
1 BDE 0.750000 112
2 Other 0.584416 77
Title_normalized¶
  • Created Title_normalized - Exhibited negligible CF distribution shift (KL = 0.011534 ).
  • Rare titles are merged into Mr/Mrs/Miss/Master groups based on their Sex, SibSp, and Age.
In [298]:
def create_feature_Title_normalized(train_df, test_df):
    """
    Creates a normalized title feature 'Title_normalized' using:
    - Original Title
    - Sex
    - SibSp
    - Age

    Logic:
    - Keep 'Mr', 'Mrs', 'Miss', and 'Master' as-is
    - For all other titles:
        - If male and Age < 14 → 'Master'
        - If male → 'Mr'
        - If female and SibSp > 0 → 'Mrs'
        - If female and SibSp == 0 → 'Miss'

    Args:
        train_df (pd.DataFrame): Training set.
        test_df (pd.DataFrame): Test set.

    Returns:
        None (adds 'Title_normalized' to both dataframes)
    """
    def normalize_title(row):
        title = row['Title']
        sex = row['Sex']
        sibsp = row['SibSp']
        age = row['Age']

        if title in ['Mr', 'Mrs', 'Miss', 'Master']:
            return title
        if sex == 'male':
            if pd.notna(age) and age < 14:
                return 'Master'
            return 'Mr'
        else:
            return 'Mrs' if sibsp > 0 else 'Miss'

    for df in [train_df, test_df]:
        df['Title_normalized'] = df.apply(normalize_title, axis=1).astype(str)

    print("Created 'Title_normalized' in train_df and test_df.")

create_feature_Title_normalized(prepared_train_df, prepared_test_df)
Created 'Title_normalized' in train_df and test_df.
In [299]:
prepared_train_df['Title_normalized'].value_counts()
Out[299]:
Title_normalized
Mr        536
Miss      174
Mrs       140
Master     41
Name: count, dtype: int64
In [300]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Title_normalized'])
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
0 Title_normalized 0.011534 0.001920 0.025101 0.008669 0.001920 0.025101 0.018127 0.006011 0.006509

Pclass_Sex One-Hot Encodings¶

In [302]:
def create_Pclass_Sex_one_hot_encodings(train_df, test_df):
    for df in [train_df, test_df]:
        dummies = pd.get_dummies(df['Pclass_Sex'], prefix='Pclass_Sex')
        df[dummies.columns] = dummies
        print(f"{len(dummies)} one-hot encodings created for Pclass x Sex: {dummies.columns}")
    return dummies.columns

pclass_sex_oh_cols = create_Pclass_Sex_one_hot_encodings(train_df, test_df)
891 one-hot encodings created for Pclass x Sex: Index(['Pclass_Sex_1_female', 'Pclass_Sex_1_male', 'Pclass_Sex_2_female',
       'Pclass_Sex_2_male', 'Pclass_Sex_3_female', 'Pclass_Sex_3_male'],
      dtype='object')
418 one-hot encodings created for Pclass x Sex: Index(['Pclass_Sex_1_female', 'Pclass_Sex_1_male', 'Pclass_Sex_2_female',
       'Pclass_Sex_2_male', 'Pclass_Sex_3_female', 'Pclass_Sex_3_male'],
      dtype='object')

Negligible distribution shift for all created Pclass x Sex one-hot encodings (all KL < 0.02)

In [304]:
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, pclass_sex_oh_cols)
display(styled)
  Feature Avg_KL_Divergence Min_KL_Divergence Max_KL_Divergence Std_KL_Divergence Fold_1_KL Fold_2_KL Fold_3_KL Fold_4_KL Fold_5_KL
3 Pclass_Sex_2_male 0.003808 0.000597 0.013918 0.005155 0.000688 0.000597 0.013918 0.000597 0.003240
0 Pclass_Sex_1_female 0.002241 0.000013 0.004184 0.001328 0.002324 0.002189 0.004184 0.002497 0.000013
4 Pclass_Sex_3_female 0.002012 0.000001 0.006726 0.002623 0.000001 0.000271 0.006726 0.000010 0.003055
1 Pclass_Sex_1_male 0.001723 0.000029 0.004394 0.001594 0.001223 0.002564 0.000029 0.004394 0.000404
2 Pclass_Sex_2_female 0.001264 0.000205 0.003621 0.001215 0.000527 0.003621 0.000205 0.000982 0.000982
5 Pclass_Sex_3_male 0.000355 0.000048 0.000762 0.000326 0.000762 0.000738 0.000048 0.000048 0.000182

Survival Association Tests¶

  • Chi-squared tests are run against selected global and pclass_sex subgrouped features to determine which have a statistically-significant association with Survival Rate (p < 0.05).
  • Features are then sorted by descending Cramer's V value (strength of association) to prioritize testing during Model Development.
In [307]:
def chi2_test_features_against_survival_with_cramers_v(df, feature_list, target_col='Survived', alpha=0.05):
    """
    Perform chi-squared tests between a list of categorical features and the target column (Survived).
    Also calculates Cramér's V to indicate the strength of the association.

    Args:
        df (pd.DataFrame): DataFrame containing the data.
        feature_list (list): List of feature column names to test.
        target_col (str): The target column to test association with.
        alpha (float): Significance level for determining statistical significance.

    Returns:
        pd.DataFrame: DataFrame summarizing chi-squared test results with Cramér's V, sorted by descending Cramér's V.
    """
    results = []

    for feature in feature_list:
        contingency = pd.crosstab(df[feature], df[target_col])
        n = contingency.sum().sum()

        if contingency.shape[0] < 2 or contingency.shape[1] < 2:
            results.append({
                'Feature': feature,
                'Chi2 Statistic': np.nan,
                'p-value': np.nan,
                'Cramer\'s V': np.nan,
                'Significant': False
            })
            continue

        chi2, p, dof, expected = chi2_contingency(contingency)
        k = min(contingency.shape)
        cramers_v = np.sqrt(chi2 / (n * (k - 1))) if k > 1 else np.nan

        results.append({
            'Feature': feature,
            'Chi2 Statistic': chi2,
            'p-value': p,
            'Cramer\'s V': cramers_v,
            'Significant': p < alpha
        })

    results_df = pd.DataFrame(results).sort_values(by="Cramer\'s V", ascending=False)

    # Style output to highlight statistically significant rows in green
    def highlight_significant(row):
        color = 'background-color: lightgreen' if row['Significant'] else ''
        return [color] * len(row)

    styled = results_df.style.apply(highlight_significant, axis=1)
    display(styled)

    return results_df
Global Feature Survival Association Tests¶
In [309]:
global_features_to_eval = [
    'Pclass_HasCabin',
    'Sex_HasCabin',
    'Embarked_HasCabin',
    'Parch_SibSp_bin',
    'HasCabin_Parch_SibSp_bin',
    'Pclass_Parch_SibSp_bin',
    'Sex_Parch_SibSp_bin',
    'Pclass_Embarked',
    'Sex_Embarked',
    'Pclass_Deck_bin',
    'Pclass_Cabin_Location_s',
    'Pclass_Title_normalized',
    'Deck_bin',
    'Title_normalized',
    'Pclass_Sex_1_female',
    'Pclass_Sex_1_male',
    'Pclass_Sex_2_female',
    'Pclass_Sex_2_male',
    'Pclass_Sex_3_female',
    'Pclass_Sex_3_male',
    'Pclass_Sex'
]

# Run chi-squared + Cramér's V analysis
results_df = chi2_test_features_against_survival_with_cramers_v(prepared_train_df, global_features_to_eval)
  Feature Chi2 Statistic p-value Cramer's V Significant
11 Pclass_Title_normalized 400.105514 0.000000 0.670114 True
20 Pclass_Sex 350.675308 0.000000 0.627356 True
1 Sex_HasCabin 315.679272 0.000000 0.595229 True
13 Title_normalized 292.273628 0.000000 0.572738 True
8 Sex_Embarked 278.911706 0.000000 0.559493 True
6 Sex_Parch_SibSp_bin 271.402109 0.000000 0.551909 True
14 Pclass_Sex_1_female 148.919875 0.000000 0.408825 True
19 Pclass_Sex_3_male 146.550069 0.000000 0.405559 True
5 Pclass_Parch_SibSp_bin 131.446775 0.000000 0.384093 True
9 Pclass_Deck_bin 123.775185 0.000000 0.372716 True
4 HasCabin_Parch_SibSp_bin 121.671640 0.000000 0.369535 True
7 Pclass_Embarked 120.638493 0.000000 0.367963 True
10 Pclass_Cabin_Location_s 120.366297 0.000000 0.367548 True
0 Pclass_HasCabin 117.021729 0.000000 0.362405 True
2 Embarked_HasCabin 103.202699 0.000000 0.340335 True
16 Pclass_Sex_2_female 98.919730 0.000000 0.333198 True
12 Deck_bin 95.786717 0.000000 0.327879 True
3 Parch_SibSp_bin 77.587742 0.000000 0.295092 True
17 Pclass_Sex_2_male 25.563777 0.000000 0.169384 True
18 Pclass_Sex_3_female 9.222372 0.002391 0.101738 True
15 Pclass_Sex_1_male 0.070848 0.790106 0.008917 False
Pclass x Sex Subgroup Feature Survival Association Tests¶
In [311]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
from IPython.display import display

def cramers_v_stat(chi2, n, k):
    return np.sqrt(chi2 / (n * (k - 1))) if k > 1 else np.nan

def chi2_test_features_by_pclass_sex(df, feature_list, target_col='Survived', alpha=0.05):
    """
    Perform chi-squared tests and Cramér's V for each feature within each Pclass x Sex subgroup.

    Args:
        df (pd.DataFrame): DataFrame containing features and target.
        feature_list (list): List of categorical feature column names to evaluate.
        target_col (str): Name of the binary target column. Default is 'Survived'.
        alpha (float): Significance level. Default is 0.05.

    Returns:
        pd.DataFrame: Styled DataFrame sorted by Pclass, Sex, and descending Cramér's V.
    """
    results = []

    for pclass in sorted(df['Pclass'].dropna().unique()):
        for sex in sorted(df['Sex'].dropna().unique()):
            subgroup_df = df[(df['Pclass'] == pclass) & (df['Sex'] == sex)]

            for feature in feature_list:
                contingency = pd.crosstab(subgroup_df[feature], subgroup_df[target_col])
                n = contingency.sum().sum()

                if contingency.shape[0] < 2 or contingency.shape[1] < 2:
                    results.append({
                        'Feature': feature,
                        'Pclass': pclass,
                        'Sex': sex,
                        'Chi2 Statistic': np.nan,
                        'p-value': np.nan,
                        'Cramer\'s V': np.nan,
                        'Significant': False
                    })
                    continue

                chi2, p, dof, _ = chi2_contingency(contingency)
                k = min(contingency.shape)
                v = cramers_v_stat(chi2, n, k)

                results.append({
                    'Feature': feature,
                    'Pclass': pclass,
                    'Sex': sex,
                    'Chi2 Statistic': chi2,
                    'p-value': p,
                    'Cramer\'s V': v,
                    'Significant': p < alpha
                })

    results_df = pd.DataFrame(results).sort_values(
        by=["Pclass", "Sex", "Cramer\'s V"],
        ascending=[True, True, False]
    )

    def highlight_significant(row):
        return ['background-color: lightgreen' if row['Significant'] else '' for _ in row]

    styled = results_df.style.apply(highlight_significant, axis=1)
    display(styled)

    return results_df
In [312]:
pclass_sex_subgroup_features_to_eval = [
    'Parch_SibSp_bin',
    'Embarked',
    'HasCabin',
    'Cabin_Location_s',
    'Deck_bin',
    'Title_normalized',
    'Age_Group',
    'FPP_log_bin'
]

results_df = chi2_test_features_by_pclass_sex(prepared_train_df, pclass_sex_subgroup_features_to_eval)
  Feature Pclass Sex Chi2 Statistic p-value Cramer's V Significant
6 Age_Group 1 female 31.269190 0.000003 0.576759 True
0 Parch_SibSp_bin 1 female 30.219349 0.000004 0.566994 True
4 Deck_bin 1 female 7.689866 0.021388 0.286019 True
3 Cabin_Location_s 1 female 1.170785 0.760020 0.111603 False
1 Embarked 1 female 0.243108 0.885543 0.050855 False
7 FPP_log_bin 1 female 0.128096 0.720414 0.036915 False
5 Title_normalized 1 female 0.005622 0.940233 0.007733 False
2 HasCabin 1 female 0.000000 1.000000 0.000000 False
14 Age_Group 1 male 10.257159 0.036312 0.289957 True
8 Parch_SibSp_bin 1 male 7.099910 0.130702 0.241238 False
15 FPP_log_bin 1 male 4.541498 0.103235 0.192939 False
11 Cabin_Location_s 1 male 4.361506 0.224981 0.189077 False
12 Deck_bin 1 male 2.851519 0.240326 0.152883 False
13 Title_normalized 1 male 2.850271 0.091359 0.152849 False
10 HasCabin 1 male 2.444523 0.117936 0.141552 False
9 Embarked 1 male 0.887638 0.641582 0.085298 False
23 FPP_log_bin 2 female 3.725582 0.444416 0.221406 False
16 Parch_SibSp_bin 2 female 1.231122 0.872948 0.127275 False
22 Age_Group 2 female 1.228990 0.746060 0.127165 False
20 Deck_bin 2 female 0.987013 0.610482 0.113961 False
17 Embarked 2 female 0.875053 0.645631 0.107303 False
19 Cabin_Location_s 2 female 0.659575 0.882668 0.093159 False
18 HasCabin 2 female 0.000000 1.000000 0.000000 False
21 Title_normalized 2 female 0.000000 1.000000 0.000000 False
30 Age_Group 2 male 49.511789 0.000000 0.677084 True
29 Title_normalized 2 male 45.854146 0.000000 0.651595 True
27 Cabin_Location_s 2 male 16.443728 0.000269 0.390201 True
24 Parch_SibSp_bin 2 male 15.727944 0.001289 0.381614 True
28 Deck_bin 2 male 13.050838 0.001466 0.347622 True
26 HasCabin 2 male 8.689608 0.003200 0.283654 True
31 FPP_log_bin 2 male 3.969421 0.264785 0.191713 False
25 Embarked 2 male 0.329199 0.848234 0.055210 False
32 Parch_SibSp_bin 3 female 22.482968 0.000161 0.395135 True
33 Embarked 3 female 14.448617 0.000729 0.316761 True
39 FPP_log_bin 3 female 7.657619 0.104956 0.230603 False
38 Age_Group 3 female 7.165236 0.127410 0.223066 False
37 Title_normalized 3 female 5.530864 0.018684 0.195982 True
35 Cabin_Location_s 3 female 2.028986 0.362586 0.118702 False
36 Deck_bin 3 female 1.228986 0.540915 0.092383 False
34 HasCabin 3 female 0.173913 0.676657 0.034752 False
45 Title_normalized 3 male 13.878592 0.000195 0.199990 True
44 Deck_bin 3 male 13.427928 0.001214 0.196716 True
46 Age_Group 3 male 12.709396 0.012787 0.191381 True
40 Parch_SibSp_bin 3 male 11.409147 0.022331 0.181327 True
47 FPP_log_bin 3 male 8.616516 0.071433 0.157580 False
41 Embarked 3 male 4.719183 0.094459 0.116619 False
43 Cabin_Location_s 3 male 2.753371 0.252414 0.089077 False
42 HasCabin 3 male 0.684198 0.408145 0.044404 False
Survival Association Test Strategy and Results¶

Strategy:

  • Chi-squared test was used to confirm the statistical significance of association between Pclass Survival Rate and Sex (p < 0.05).
  • Cramer's V value was calculated for each combination to sort features in descending order by association strength.
  • Features were split into two categories:
    • "Global": Features informing rules shared across Pclass x Sex subgroups.
    • "Pclass x Sex Subgroup": Features informing rules constrained to one Pclass x Sex subgroup (e.g. P1 Males, P3 Females).

Test Results:

  • The following features will be used as the bases for smoothed survival rate features for modeling (sorted by descending Cramer's V):
    • Global Features:
      • Pclass_Title_normalized
      • Pclass_Sex
      • Sex_HasCabin
      • Title_normalized
      • Sex_Embarked
      • Sex_Parch_SibSp_bin
      • Pclass_Sex_1_female
      • Pclass_Sex_3_male
      • Pclass_Parch_SibSp_bin
      • Pclass_Deck_bin
      • HasCabin_Parch_SibSp_bin
      • Pclass_Embarked
      • Pclass_Cabin_Location_s
      • Pclass_HasCabin
      • Embarked_HasCabin
      • Pclass_Sex_2_female
      • Deck_bin
      • Parch_SibSp_bin
      • Pclass_Sex_2_male
      • Pclass_Sex_3_female
    • Pclass x Sex Subgroup Features:
      • Pclass 1, Sex female
        • Age_Group
        • Parch_SibSp_bin
        • Deck_bin
      • Pclass 1, Sex male
        • Age_Group
      • Pclass 2, Sex male
        • Age_Group
        • Title_normalized
        • Cabin_Location_s
        • Parch_SibSp_bin
        • Deck_bin
        • HasCabin
      • Pclass 3, Sex female
        • Parch_SibSp_bin
        • Embarked
        • Age_Group
        • Title_normalized
      • Pclass 3, Sex male
        • Age_Group
        • Title_normalized
        • Deck_bin
        • Parch_SibSp_bin

Smoothed Survival Rate Feature Engineering¶

  • Smoothed Survival Rate Features ("Smoothed Features") are target-encoded features that use the survival rates observed in the training set data for particular subgroups to inform the survival prediction of the same subgroups in the submission data set.
  • To mitigate data leakage, the following actions are performed:
    • No information from the test set is ever used in calculating group-level or global survival statistics. All smoothed values for the test set are computed exclusively using the full training set.
    • When preparing training data set smoothed features for cross-validation, the full training data set is never used for any calculations of group or global means.
      • The means can only be calculated using training fold data
      • The calculated smoothed rates are only applied to the validation fold data
    • When preparing submission test data set smoothed features, only the full training data set is used.
  • Smoothed rates are calculated using a Bayesian adjustment formula to balance subgroup means with the overall global mean, taking into account low-sample subgroups.
    • For training data set cross-validation preparation:
      • grouped['smoothed'] = ((grouped['group_mean'] * grouped['group_count'] + prior * fold_global_mean) /(grouped['group_count'] + prior))
    • For submission test data set preparation:
      • grouped_full['smoothed'] = ((grouped_full['group_mean'] * grouped_full['group_count'] + prior * full_global_mean) /(grouped_full['group_count'] + prior))
  • Subgroup sample sizes less than 10 are excluded from the smoothed rate calculation.
  • A relevance mask is applied to each smoothed feature to zero-out its values for passengers that do not match the feature's pclass_sex subgroup.
    • This ensures that the model only learns from smoothed features that are relevant to the passenger’s actual subgroup.
    • Example: A smoothed feature for Pclass=1, Sex=Female is set to 0 for a male passenger in Pclass=3.
In [317]:
# For use in creating leakage-free standalone smoothed rate features (e.g. Title_bin_smoothed)
def generate_global_smoothed_feature(train_df, target_col, group_col,
                                     test_df=None, prior=10, feature_name=None,
                                     n_splits=5, random_state=42):
    """
    Create a globally smoothed target encoding for a categorical feature using CV-based out-of-fold encoding.
    Prevents leakage by computing smoothed values within each fold.
    Groups with n < 10 are excluded from the smoothed map and default to the global mean.

    Parameters:
        train_df (pd.DataFrame): Training set.
        target_col (str): Target variable (e.g. 'Survived').
        group_col (str): Categorical feature to encode (e.g. 'Title_bin').
        test_df (pd.DataFrame or None): Optional test set to encode.
        prior (int): Smoothing strength for Bayesian mean.
        feature_name (str or None): Optional name for the new feature.
        n_splits (int): Number of CV folds for OOF encoding.
        random_state (int): Seed for reproducibility.

    Returns:
        str: Name of the generated smoothed feature.
    """
    if feature_name is None:
        feature_name = f'global_{group_col}_smoothed'

    oof_feature = pd.Series(0.0, index=train_df.index)
    global_mean = train_df[target_col].mean()

    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    for train_idx, val_idx in skf.split(train_df, train_df[target_col]):
        fold_train = train_df.iloc[train_idx]
        fold_val = train_df.iloc[val_idx]

        fold_global_mean = fold_train[target_col].mean()

        grouped = (
            fold_train.groupby(group_col, observed=True)[target_col]
            .agg(['mean', 'count'])
            .rename(columns={'mean': 'group_mean', 'count': 'group_count'})
        )

        # Filter to exclude low-sample groups
        grouped = grouped[grouped['group_count'] >= 10]

        grouped['smoothed'] = (
            (grouped['group_mean'] * grouped['group_count'] + prior * fold_global_mean) /
            (grouped['group_count'] + prior)
        )

        smoothed_map = grouped['smoothed'].to_dict()
        val_keys = fold_val[group_col]
        smoothed_vals = val_keys.map(smoothed_map).fillna(fold_global_mean)
        oof_feature.iloc[val_idx] = smoothed_vals

    train_df[feature_name] = oof_feature
    print(f"✅ Added feature '{feature_name}' to train_df.")

    if test_df is not None:
        grouped_full = (
            train_df.groupby(group_col, observed=True)[target_col]
            .agg(['mean', 'count'])
            .rename(columns={'mean': 'group_mean', 'count': 'group_count'})
        )

        grouped_full = grouped_full[grouped_full['group_count'] >= 10]

        grouped_full['smoothed'] = (
            (grouped_full['group_mean'] * grouped_full['group_count'] + prior * global_mean) /
            (grouped_full['group_count'] + prior)
        )

        smoothed_map_test = grouped_full['smoothed'].to_dict()
        test_keys = test_df[group_col]
        smoothed_vals = test_keys.map(smoothed_map_test).fillna(global_mean)
        test_df[feature_name] = smoothed_vals

        print(f"✅ Added feature '{feature_name}' to test_df.")

    return feature_name




# For use in creating leakage-free Pclass x Sex smoothed rate features (e.g. P1_Male_Title_bin_smoothed)
def generate_subgroup_smoothed_feature(train_df, target_col, pclass_val, sex_val, group_col=None,
                                       test_df=None, feature_name=None, prior=10, n_splits=5, random_state=42):
    """
    Adds an out-of-fold smoothed target encoding feature to train_df (and optionally test_df),
    for a specific Pclass × Sex subgroup, optionally grouped by another column.

    If group_col is None, a single smoothed rate is applied to the subgroup.

    Parameters:
        train_df (pd.DataFrame): Training DataFrame.
        target_col (str): Target variable (e.g., 'Survived').
        pclass_val (int): Pclass value (1, 2, or 3).
        sex_val (str): 'male' or 'female'.
        group_col (str or None): If given, compute rates per group_col. Otherwise, single subgroup rate.
        test_df (pd.DataFrame, optional): Optional test DataFrame.
        feature_name (str, optional): Feature name to assign. Auto-generated if None.
        prior (float): Smoothing strength.
        n_splits (int): StratifiedKFold folds.
        random_state (int): Seed.

    Returns:
        str: Name of the feature added to train_df (and test_df if given).
    """
    group_label = group_col if group_col else "overall"
    if feature_name is None:
        feature_name = f'P{pclass_val}_{sex_val.capitalize()}_{group_label}_smoothed'

    mask_train = (train_df['Pclass'] == pclass_val) & (train_df['Sex'] == sex_val)
    subgroup_df = train_df[mask_train].copy()

    oof_feature = pd.Series(0.0, index=train_df.index)

    if group_col is None:
        # Handle subgroup-wide smoothing without further grouping
        skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
        for train_idx, val_idx in skf.split(subgroup_df, subgroup_df[target_col]):
            fold_train = subgroup_df.iloc[train_idx]
            fold_val = subgroup_df.iloc[val_idx]

            fold_global_mean = train_df[target_col].mean()
            group_mean = fold_train[target_col].mean()
            group_count = len(fold_train)

            smoothed_value = (
                (group_mean * group_count + prior * fold_global_mean) /
                (group_count + prior)
            )

            oof_feature.iloc[fold_val.index] = smoothed_value

    else:
        # Normal group_col-specific smoothing
        skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
        for train_idx, val_idx in skf.split(subgroup_df, subgroup_df[target_col]):
            fold_train = subgroup_df.iloc[train_idx]
            fold_val = subgroup_df.iloc[val_idx]

            fold_global_mean = fold_train[target_col].mean()

            grouped = (
                fold_train.groupby(group_col, observed=True)[target_col]
                .agg(['mean', 'count'])
                .rename(columns={'mean': 'group_mean', 'count': 'group_count'})
            )
            grouped = grouped[grouped['group_count'] >= 10].copy()

            grouped['smoothed'] = (
                (grouped['group_mean'] * grouped['group_count'] + prior * fold_global_mean) /
                (grouped['group_count'] + prior)
            )

            smoothed_map = grouped['smoothed'].to_dict()
            val_keys = fold_val[group_col]
            oof_feature.loc[fold_val.index] = val_keys.map(smoothed_map).fillna(0.0)

    train_df[feature_name] = oof_feature
    print(f"✅ Added feature '{feature_name}' to train_df (Pclass={pclass_val}, Sex={sex_val})")

    if test_df is not None:
        mask_test = (test_df['Pclass'] == pclass_val) & (test_df['Sex'] == sex_val)
        test_df[feature_name] = 0.0  # Default value for all

        if group_col is None:
            global_mean = train_df[target_col].mean()
            subgroup_mean = subgroup_df[target_col].mean()
            subgroup_count = len(subgroup_df)

            smoothed_value = (
                (subgroup_mean * subgroup_count + prior * global_mean) /
                (subgroup_count + prior)
            )

            test_df.loc[mask_test, feature_name] = smoothed_value

        else:
            global_mean = subgroup_df[target_col].mean()
            grouped = (
                subgroup_df.groupby(group_col, observed=True)[target_col]
                .agg(['mean', 'count'])
                .rename(columns={'mean': 'group_mean', 'count': 'group_count'})
            )
            grouped = grouped[grouped['group_count'] >= 10].copy()

            grouped['smoothed'] = (
                (grouped['group_mean'] * grouped['group_count'] + prior * global_mean) /
                (grouped['group_count'] + prior)
            )

            smoothed_map_test = grouped['smoothed'].to_dict()
            test_keys = test_df.loc[mask_test, group_col]
            test_df.loc[mask_test, feature_name] = test_keys.map(smoothed_map_test).fillna(0.0)

        print(f"✅ Added feature '{feature_name}' to test_df (Pclass={pclass_val}, Sex={sex_val})")

    return feature_name
Generate Global Smoothed Features¶
In [319]:
global_feature_list = [
    "Pclass_Title_normalized",
    "Pclass_Sex",
    "Sex_HasCabin",
    "Title_normalized",
    "Sex_Embarked",
    "Sex_Parch_SibSp_bin",
    "Pclass_Parch_SibSp_bin",
    "Pclass_Deck_bin",
    "HasCabin_Parch_SibSp_bin",
    "Pclass_Embarked",
    "Pclass_Cabin_Location_s",
    "Pclass_HasCabin",
    "Embarked_HasCabin",
    "Deck_bin",
    "Parch_SibSp_bin",

    # These will be accounted for in the next section
    #"Pclass_Sex_2_female",
    #"Pclass_Sex_1_female",
    #"Pclass_Sex_3_male",
    #"Pclass_Sex_2_male",
    #"Pclass_Sex_3_female"
]
global_feature_cols =  []
for col in global_feature_list:
    global_feature_col = generate_global_smoothed_feature(
        train_df=prepared_train_df,
        target_col='Survived',
        group_col=col,
        test_df=prepared_test_df  # optional; omit if not available
    )
    global_feature_cols.append(global_feature_col)
✅ Added feature 'global_Pclass_Title_normalized_smoothed' to train_df.
✅ Added feature 'global_Pclass_Title_normalized_smoothed' to test_df.
✅ Added feature 'global_Pclass_Sex_smoothed' to train_df.
✅ Added feature 'global_Pclass_Sex_smoothed' to test_df.
✅ Added feature 'global_Sex_HasCabin_smoothed' to train_df.
✅ Added feature 'global_Sex_HasCabin_smoothed' to test_df.
✅ Added feature 'global_Title_normalized_smoothed' to train_df.
✅ Added feature 'global_Title_normalized_smoothed' to test_df.
✅ Added feature 'global_Sex_Embarked_smoothed' to train_df.
✅ Added feature 'global_Sex_Embarked_smoothed' to test_df.
✅ Added feature 'global_Sex_Parch_SibSp_bin_smoothed' to train_df.
✅ Added feature 'global_Sex_Parch_SibSp_bin_smoothed' to test_df.
✅ Added feature 'global_Pclass_Parch_SibSp_bin_smoothed' to train_df.
✅ Added feature 'global_Pclass_Parch_SibSp_bin_smoothed' to test_df.
✅ Added feature 'global_Pclass_Deck_bin_smoothed' to train_df.
✅ Added feature 'global_Pclass_Deck_bin_smoothed' to test_df.
✅ Added feature 'global_HasCabin_Parch_SibSp_bin_smoothed' to train_df.
✅ Added feature 'global_HasCabin_Parch_SibSp_bin_smoothed' to test_df.
✅ Added feature 'global_Pclass_Embarked_smoothed' to train_df.
✅ Added feature 'global_Pclass_Embarked_smoothed' to test_df.
✅ Added feature 'global_Pclass_Cabin_Location_s_smoothed' to train_df.
✅ Added feature 'global_Pclass_Cabin_Location_s_smoothed' to test_df.
✅ Added feature 'global_Pclass_HasCabin_smoothed' to train_df.
✅ Added feature 'global_Pclass_HasCabin_smoothed' to test_df.
✅ Added feature 'global_Embarked_HasCabin_smoothed' to train_df.
✅ Added feature 'global_Embarked_HasCabin_smoothed' to test_df.
✅ Added feature 'global_Deck_bin_smoothed' to train_df.
✅ Added feature 'global_Deck_bin_smoothed' to test_df.
✅ Added feature 'global_Parch_SibSp_bin_smoothed' to train_df.
✅ Added feature 'global_Parch_SibSp_bin_smoothed' to test_df.
In [320]:
# Construct list of all smoothed feature dicts
smoothed_features_to_create = [
    { 'pclass': 1, 'sex': 'male', 'group_col': None },
    { 'pclass': 2, 'sex': 'male', 'group_col': None },
    { 'pclass': 3, 'sex': 'male', 'group_col': None },  
    { 'pclass': 1, 'sex': 'female', 'group_col': None },
    { 'pclass': 2, 'sex': 'female', 'group_col': None },
    { 'pclass': 3, 'sex': 'female', 'group_col': None },  
    
    { 'pclass': 1, 'sex': 'female', 'group_col': 'Age_Group' },
    { 'pclass': 1, 'sex': 'female', 'group_col': 'Parch_SibSp_bin' },
    { 'pclass': 1, 'sex': 'female', 'group_col': 'Deck_bin' },

    { 'pclass': 1, 'sex': 'male', 'group_col': 'Age_Group' },

    { 'pclass': 2, 'sex': 'male', 'group_col': 'Age_Group' }, 
    { 'pclass': 2, 'sex': 'male', 'group_col': 'Title_normalized' }, 
    { 'pclass': 2, 'sex': 'male', 'group_col': 'Cabin_Location_s' }, 
    { 'pclass': 2, 'sex': 'male', 'group_col': 'Parch_SibSp_bin' }, 
    { 'pclass': 2, 'sex': 'male', 'group_col': 'Deck_bin' }, 
    { 'pclass': 2, 'sex': 'male', 'group_col': 'HasCabin' }, 
    
    { 'pclass': 3, 'sex': 'female', 'group_col': 'Parch_SibSp_bin' },
    { 'pclass': 3, 'sex': 'female', 'group_col': 'Embarked' },
    { 'pclass': 3, 'sex': 'female', 'group_col': 'Age_Group' },
    { 'pclass': 3, 'sex': 'female', 'group_col': 'Title_normalized' },

    { 'pclass': 3, 'sex': 'male', 'group_col': 'Age_Group' },
    { 'pclass': 3, 'sex': 'male', 'group_col': 'Title_normalized' },
    { 'pclass': 3, 'sex': 'male', 'group_col': 'Deck_bin' },
    { 'pclass': 3, 'sex': 'male', 'group_col': 'Parch_SibSp_bin' },
]
smoothed_feature_cols =  []
for config in smoothed_features_to_create:
    smoothed_feature_col = generate_subgroup_smoothed_feature(prepared_train_df, 'Survived', config['pclass'], config['sex'], config['group_col'],
                                       test_df=prepared_test_df)
    smoothed_feature_cols.append(smoothed_feature_col)
print(f"Created {len(smoothed_feature_cols)} smoothed features.")
✅ Added feature 'P1_Male_overall_smoothed' to train_df (Pclass=1, Sex=male)
✅ Added feature 'P1_Male_overall_smoothed' to test_df (Pclass=1, Sex=male)
✅ Added feature 'P2_Male_overall_smoothed' to train_df (Pclass=2, Sex=male)
✅ Added feature 'P2_Male_overall_smoothed' to test_df (Pclass=2, Sex=male)
✅ Added feature 'P3_Male_overall_smoothed' to train_df (Pclass=3, Sex=male)
✅ Added feature 'P3_Male_overall_smoothed' to test_df (Pclass=3, Sex=male)
✅ Added feature 'P1_Female_overall_smoothed' to train_df (Pclass=1, Sex=female)
✅ Added feature 'P1_Female_overall_smoothed' to test_df (Pclass=1, Sex=female)
✅ Added feature 'P2_Female_overall_smoothed' to train_df (Pclass=2, Sex=female)
✅ Added feature 'P2_Female_overall_smoothed' to test_df (Pclass=2, Sex=female)
✅ Added feature 'P3_Female_overall_smoothed' to train_df (Pclass=3, Sex=female)
✅ Added feature 'P3_Female_overall_smoothed' to test_df (Pclass=3, Sex=female)
✅ Added feature 'P1_Female_Age_Group_smoothed' to train_df (Pclass=1, Sex=female)
✅ Added feature 'P1_Female_Age_Group_smoothed' to test_df (Pclass=1, Sex=female)
✅ Added feature 'P1_Female_Parch_SibSp_bin_smoothed' to train_df (Pclass=1, Sex=female)
✅ Added feature 'P1_Female_Parch_SibSp_bin_smoothed' to test_df (Pclass=1, Sex=female)
✅ Added feature 'P1_Female_Deck_bin_smoothed' to train_df (Pclass=1, Sex=female)
✅ Added feature 'P1_Female_Deck_bin_smoothed' to test_df (Pclass=1, Sex=female)
✅ Added feature 'P1_Male_Age_Group_smoothed' to train_df (Pclass=1, Sex=male)
✅ Added feature 'P1_Male_Age_Group_smoothed' to test_df (Pclass=1, Sex=male)
✅ Added feature 'P2_Male_Age_Group_smoothed' to train_df (Pclass=2, Sex=male)
✅ Added feature 'P2_Male_Age_Group_smoothed' to test_df (Pclass=2, Sex=male)
✅ Added feature 'P2_Male_Title_normalized_smoothed' to train_df (Pclass=2, Sex=male)
✅ Added feature 'P2_Male_Title_normalized_smoothed' to test_df (Pclass=2, Sex=male)
✅ Added feature 'P2_Male_Cabin_Location_s_smoothed' to train_df (Pclass=2, Sex=male)
✅ Added feature 'P2_Male_Cabin_Location_s_smoothed' to test_df (Pclass=2, Sex=male)
✅ Added feature 'P2_Male_Parch_SibSp_bin_smoothed' to train_df (Pclass=2, Sex=male)
✅ Added feature 'P2_Male_Parch_SibSp_bin_smoothed' to test_df (Pclass=2, Sex=male)
✅ Added feature 'P2_Male_Deck_bin_smoothed' to train_df (Pclass=2, Sex=male)
✅ Added feature 'P2_Male_Deck_bin_smoothed' to test_df (Pclass=2, Sex=male)
C:\Users\pault\anaconda3\Lib\site-packages\sklearn\model_selection\_split.py:805: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=5.
  warnings.warn(
C:\Users\pault\anaconda3\Lib\site-packages\sklearn\model_selection\_split.py:805: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=5.
  warnings.warn(
C:\Users\pault\anaconda3\Lib\site-packages\sklearn\model_selection\_split.py:805: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=5.
  warnings.warn(
C:\Users\pault\anaconda3\Lib\site-packages\sklearn\model_selection\_split.py:805: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=5.
  warnings.warn(
✅ Added feature 'P2_Male_HasCabin_smoothed' to train_df (Pclass=2, Sex=male)
✅ Added feature 'P2_Male_HasCabin_smoothed' to test_df (Pclass=2, Sex=male)
✅ Added feature 'P3_Female_Parch_SibSp_bin_smoothed' to train_df (Pclass=3, Sex=female)
✅ Added feature 'P3_Female_Parch_SibSp_bin_smoothed' to test_df (Pclass=3, Sex=female)
✅ Added feature 'P3_Female_Embarked_smoothed' to train_df (Pclass=3, Sex=female)
✅ Added feature 'P3_Female_Embarked_smoothed' to test_df (Pclass=3, Sex=female)
✅ Added feature 'P3_Female_Age_Group_smoothed' to train_df (Pclass=3, Sex=female)
✅ Added feature 'P3_Female_Age_Group_smoothed' to test_df (Pclass=3, Sex=female)
✅ Added feature 'P3_Female_Title_normalized_smoothed' to train_df (Pclass=3, Sex=female)
✅ Added feature 'P3_Female_Title_normalized_smoothed' to test_df (Pclass=3, Sex=female)
✅ Added feature 'P3_Male_Age_Group_smoothed' to train_df (Pclass=3, Sex=male)
✅ Added feature 'P3_Male_Age_Group_smoothed' to test_df (Pclass=3, Sex=male)
✅ Added feature 'P3_Male_Title_normalized_smoothed' to train_df (Pclass=3, Sex=male)
✅ Added feature 'P3_Male_Title_normalized_smoothed' to test_df (Pclass=3, Sex=male)
✅ Added feature 'P3_Male_Deck_bin_smoothed' to train_df (Pclass=3, Sex=male)
✅ Added feature 'P3_Male_Deck_bin_smoothed' to test_df (Pclass=3, Sex=male)
✅ Added feature 'P3_Male_Parch_SibSp_bin_smoothed' to train_df (Pclass=3, Sex=male)
✅ Added feature 'P3_Male_Parch_SibSp_bin_smoothed' to test_df (Pclass=3, Sex=male)
Created 24 smoothed features.
Is_Shared_Ticket¶
  • Important Notes re: Data Leakage:
    • To prevent test data set leakage, it's critical to only use the training data set when calculating the frequency in which a ticket is shared amongst passengers.
    • Knowing that tickets are also shared by passengers in the test data set (and likely beyond), this makes the frequency calculation more of a "weight" to aid training prediction, rather than a true calculation of ticket frequency.
    • A share count map is implemented here to make the ticket share counts of the training data available in the test data.
  • Analysis of shared ticket count amongst training data applied to match tickets in test data revealed the frequency of matching counts dropped significantly for Share_Ticket_Count >= 1, motivating creating a binary Is_Shared_Ticket feature instead to indicate whether or not a given ticket was shared by training data passengers.
  • Distribution Shift of Is_Shared_Ticket:
    • Sparse distribution of test data passengers with tickets matching training set shared tickets
    • Will keep in mind during feature experimentation and drop from model if it contributes to overfitting.
In [323]:
def create_feature_Is_Shared_Ticket(train_df, test_df):
    """
    Creates "Is_Shared_Ticket" feature, which is a binary integer indicating whether a given ticket was shared by any other training set passenger.

    Is_Shared_Ticket of tickets not shared by any other passenger is set to zero / False.
    
    Training data Is_Shared_Ticket value is associated to tickets in the test data that share the same ticket number.
    
    Is_Shared_Ticket of test data ticket numbers that do not appear in the training data is set to zero.
    
    To prevent test data set leakage, it's critical to only use the training data set when calculating the 
    count ticket is shared amongst passengers.

    Args:
        train_df (DataFrame): Training data set  
        test_df (DataFrame): Test data set 
    Returns:
        Nothing
    """

    # Calculate ticket counts, including subtracting by one to prevent giving weight to tickets not being shared by others
    training_ticket_counts = train_df['Ticket'].value_counts() - 1 
    is_shared_ticket_df = training_ticket_counts > 0
    
    train_df['Is_Shared_Ticket'] = train_df['Ticket'].map(is_shared_ticket_df).astype(int)
    test_df['Is_Shared_Ticket'] = test_df['Ticket'].map(is_shared_ticket_df).fillna(0).astype(int)

create_feature_Is_Shared_Ticket(prepared_train_df, prepared_test_df)

Model Development¶

In [325]:
# Create functions to identify highly-correlated feature pairs to remove
def get_highly_correlated_feature_pairs(df, threshold=0.85):
    """
    Returns a DataFrame of feature pairs with correlation above the specified threshold.

    Args:
        df (pd.DataFrame): DataFrame of features (should be numeric or dummy-encoded).
        threshold (float): Minimum correlation to include in output.

    Returns:
        pd.DataFrame: Correlated feature pairs and their correlation coefficient.
    """
    corr_matrix = df.corr().abs()
    upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

    # Filter for correlations above the threshold
    high_corr = upper_triangle.stack().reset_index()
    high_corr.columns = ['Feature A', 'Feature B', 'Correlation']
    return high_corr[high_corr['Correlation'] >= threshold].sort_values(by='Correlation', ascending=False)

def identify_lower_importance_correlated_features_to_drop(corr_df, importance_dict, threshold=0.85):
    """
    For each correlated feature pair, identifies the feature with lower importance and prints the decision.

    Args:
        corr_df (pd.DataFrame): Output from get_highly_correlated_feature_pairs().
        importance_dict (dict or pd.Series): Feature importance scores (higher is better).
        threshold (float): Correlation threshold to consider dropping features.

    Returns:
        set: Set of feature names to drop.
    """
    to_drop = set()

    print(f"Features with correlation ≥ {threshold} and lower importance:\n")

    for _, row in corr_df.iterrows():
        if row['Correlation'] >= threshold:
            feat_a = row['Feature A']
            feat_b = row['Feature B']
            imp_a = importance_dict.get(feat_a, 0)
            imp_b = importance_dict.get(feat_b, 0)

            if imp_a >= imp_b:
                to_drop.add(feat_b)
                print(f"Drop: {feat_b:30} (Importance: {imp_b:.5f})  ⬅️ Keep: {feat_a:30} (Importance: {imp_a:.5f})")
            else:
                to_drop.add(feat_a)
                print(f"Drop: {feat_a:30} (Importance: {imp_a:.5f})  ⬅️ Keep: {feat_b:30} (Importance: {imp_b:.5f})")

    return to_drop
In [326]:
# Create reusable functions to evaluate learning curve and feature importances (where supported) for each model
def plot_validation_curve(model, param_name, param_range, selected_features=[], drop_cols=[]):

    X_all = pd.concat([prepared_train_df[selected_features], prepared_test_df[selected_features]], axis=0)
    X_all_encoded = pd.get_dummies(X_all, drop_first=False)
    
    X_train_encoded = X_all_encoded.iloc[:len(train_df)].copy()
    X_test_encoded = X_all_encoded.iloc[len(train_df):].copy()

    def clean_encodings_for_xgb(train_encoded_df, test_encoded_df):
        for df in [train_encoded_df, test_encoded_df]:
            df.columns = df.columns.str.replace(r'[<>\[\]\(\)]', '', regex=True)
            df.columns = df.columns.str.replace(', ', '_', regex=False)
            df.columns = df.columns.str.replace(r'[^0-9a-zA-Z_]', '_', regex=True)
    clean_encodings_for_xgb(X_train_encoded, X_test_encoded)

    if drop_cols:
        X_train_encoded.drop(columns=drop_cols, inplace=True)
        X_test_encoded.drop(columns=drop_cols, inplace=True)
             
    y = prepared_train_df['Survived']
  
    print("Evaluating baseline model with the following variables:")
    for col in X_train_encoded.columns:
        print(f"* {col}")

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    train_scores, val_scores = validation_curve(
        estimator=model,  # e.g., DecisionTreeClassifier
        X=X_train_encoded,
        y=y,
        param_name=param_name,
        param_range=param_range,
        cv=cv,
        scoring='accuracy'
    )
    
    # Compute means
    train_mean = train_scores.mean(axis=1)
    val_mean = val_scores.mean(axis=1)
    
    # Plot
    plt.plot(param_range, train_mean, label="Training Score")
    plt.plot(param_range, val_mean, label="Validation Score")
    plt.xlabel(param_name)
    plt.ylabel("Accuracy")
    plt.title("Validation Curve: max_depth")
    title = f"Validation Curve for {param_name}"
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()

def plot_learning_curve(model, X, y, label=None):
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    train_sizes, train_scores, val_scores = learning_curve(
        estimator=model,
        X=X,
        y=y,
        cv=cv,
        scoring='accuracy',
        n_jobs=2,
        train_sizes=np.linspace(0.1, 1.0, 10)
    )
    
    # Compute mean and std
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)

    # Plot curves
    plt.plot(train_sizes, train_mean, label='Training Score', color='blue')
    plt.plot(train_sizes, val_mean, label='Validation Score', color='orange')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.2, color='blue')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.2, color='orange')

    # Get final values
    final_train_acc = train_mean[-1]
    final_val_acc = val_mean[-1]
    delta = final_train_acc - final_val_acc

    # Annotate final scores and delta with 4 decimal places
    plt.text(train_sizes[-1], final_train_acc + 0.005, f"Train: {final_train_acc:.4f}", color='blue')
    plt.text(train_sizes[-1], final_val_acc - 0.035, f"Val: {final_val_acc:.4f}", color='orange')
    plt.text(train_sizes[-1] * 0.5, min(val_mean) - 0.06,
             f"Δ (Train - Val): {delta:.4f}", fontsize=10, style='italic', color='gray')

    # Labels and formatting
    plt.xlabel("Training Set Size")
    plt.ylabel("Accuracy")
    title = f"Learning Curve for {label}" if label else "Learning Curve"
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

def plot_feature_importances(model, X):
    importances = pd.Series(model.feature_importances_, index=X.columns)
    importances = importances.sort_values(ascending=False)

    # Print feature importances in a grid format
    importance_df = pd.DataFrame({
        'Feature': importances.index,
        'Importance': importances.values
    })
    print("\nEstimator Feature Importances:\n")
    print(importance_df.to_string(index=False))

    # Plot
    plt.figure(figsize=(8, 5))
    ax = importances.plot(kind='bar')
    plt.title(f"Feature Importances from {model.__class__.__name__}")
    plt.ylabel("Relative Importance")
    
    # Add value labels above each bar
    for i, value in enumerate(importances):
        plt.text(i, value + 0.001, f"{value:.3f}", ha='center', va='bottom', fontsize=9)
    
    plt.tight_layout()
    plt.show()

def analyze_mistake_overrepresentation(worst_fold_mistakes, full_df):
    bool_cols = worst_fold_mistakes.select_dtypes(bool).columns
    mistake_counts = worst_fold_mistakes[bool_cols].sum()
    full_counts = full_df[bool_cols].sum()
    full_counts = full_counts.replace(0, pd.NA)
    ratio = (mistake_counts / full_counts).sort_values(ascending=False)
    comparison_df = pd.DataFrame({
        'Mistake_Count': mistake_counts,
        'Train_Count': full_counts,
        'Overrepresentation_Rate': ratio
    }).dropna().sort_values(by='Overrepresentation_Rate', ascending=False)
    return comparison_df

def plot_feature_cooccurrence_heatmap(worst_fold_mistakes, focus_col, exclude_cols=None):
    """
    Creates a heatmap showing the number of mistakes where focus_col and each other feature are both 1.
    
    Parameters:
    - worst_fold_mistakes: DataFrame containing only the mistaken predictions
    - focus_col: The binary feature to cross with all others (e.g. 'Pclass_3')
    - exclude_cols: Optional list of columns to ignore (e.g. ['Actual', 'Predicted'])
    """
    if exclude_cols is None:
        exclude_cols = ['Actual', 'Predicted']
    feature_cols = [col for col in worst_fold_mistakes.columns if col not in exclude_cols + [focus_col]]

    # Filter to rows where focus_col == 1
    relevant_mistakes = worst_fold_mistakes[worst_fold_mistakes[focus_col] == 1]

    # Count co-occurrences
    counts = {}
    for col in feature_cols:
        counts[col] = (relevant_mistakes[col] == 1).sum()
    
    # Convert to DataFrame for heatmap
    df_counts = pd.DataFrame.from_dict(counts, orient='index', columns=['Mistake_Count']).sort_values('Mistake_Count', ascending=False)

    # Plot
    plt.figure(figsize=(8, len(df_counts) * 0.4 + 1))
    sns.heatmap(df_counts.T, annot=True, cmap='Reds', cbar=False, fmt='d')
    plt.title(f"Mistake Counts When {focus_col} == 1")
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()
In [327]:
# Create reusable methods to accelerate feature experimentation

def iterate_model(selected_features=[], drop_cols=[], feature_importances=False, permutation_importances=False, learning_curve=False, analyze_mistakes=True, **model_params):

    X_train_selected = prepared_train_df[selected_features].copy() 
    X_test_selected = prepared_test_df[selected_features].copy()
    y_train_full = prepared_train_df['Survived']

    if drop_cols:
        X_train_selected.drop(columns=drop_cols, inplace=True)
        X_test_selected.drop(columns=drop_cols, inplace=True)
        print(f"\nDropping these variables from model input:")
        for col in drop_cols:
            print(f"* {col}")

    clf = XGBClassifier(**model_params)

    print(f"\nEvaluating {clf.__class__.__name__} model with the following variables:")
    for col in X_train_selected.columns:
        print(f"* {col}")
        
    cv_scores = []
    fold_indices = []
    all_preds = []
    oof_preds = np.zeros_like(y_train_full)
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    for train_idx, test_idx in skf.split(X_train_selected, y_train_full):
        X_train, X_val = X_train_selected.iloc[train_idx], X_train_selected.iloc[test_idx]
        y_train, y_val = y_train_full.iloc[train_idx], y_train_full.iloc[test_idx]

        clf.fit(X_train, y_train)
        preds = clf.predict(X_val)

        cv_scores.append(accuracy_score(y_val, preds))
        fold_indices.append((train_idx, test_idx))
        all_preds.append(preds)
        oof_preds[test_idx] = preds
                         
    worst_fold_idx = np.argmin(cv_scores)
    worst_test_idx = fold_indices[worst_fold_idx][1]

    X_worst = X_train_selected.iloc[worst_test_idx].copy()
    y_worst = y_train_full.iloc[worst_test_idx].copy()
    preds_worst = all_preds[worst_fold_idx]
    
    mistakes = X_worst[y_worst != preds_worst].copy()
    mistakes['Actual'] = y_worst[y_worst != preds_worst]
    mistakes['Predicted'] = preds_worst[y_worst != preds_worst]

    print(f"\nCV Scores: {cv_scores}")
    print(f"Worst Fold Index: {worst_fold_idx}")
    print(f"Mean Accuracy: {np.mean(cv_scores):.4f}")
    print(f"Standard Deviation: {np.std(cv_scores):.4f}")

    ## Retrain classifier will full training dataset + sample weights
    clf.fit(X_train_selected, y_train_full)

    if learning_curve:
        plot_learning_curve(clf, X_train_selected, y_train_full, label=f"{len(X_train_selected.columns)} columns") 
    
    if feature_importances:   
        plot_feature_importances(clf, X_train_selected)

    if permutation_importances:
        result = permutation_importance(clf, X_train_selected, y_train_full, n_repeats=10, random_state=42, scoring='accuracy')
        importances = pd.DataFrame({
            'feature': X_train_selected.columns,
            'importance_mean': result.importances_mean,
            'importance_std': result.importances_std
        }).sort_values(by='importance_mean', ascending=False)
        print("\nPermutation Importances:")
        print(importances)
        print()

    corr_df = get_highly_correlated_feature_pairs(X_train_selected, threshold=0.85)
    importance_dict = dict(zip(clf.feature_names_in_, clf.feature_importances_))
    features_to_drop = identify_lower_importance_correlated_features_to_drop(corr_df, importance_dict)
    print(f"Found {len(features_to_drop)} feature(s) to drop.\n\n")

    # Return (fitted model, filtered test_df, mistakes, filtered train_df) tuple for submission pipeline
    return clf, X_test_selected, oof_preds, X_train_selected

Baseline Establishment¶

Baseline accuracy scores are established to gauge the progress and efficacy of developing models using our engineered features.

Predict Majority Class¶

First baseline taken simply calculates the accuracy of a simulated prediction output that assigns every passenger to the majority target class.

  • Baseline Accuracy of Predicting Majority Class: 62%
In [332]:
# Create a majority-class series and score it against the target values in the training data
target = prepared_train_df['Survived']
majority_class = target.mode()[0]
baseline_majority = pd.Series([majority_class] * len(target))
baseline_accuracy = accuracy_score(target, baseline_majority)
print(f"Baseline majority-class accuracy: {baseline_accuracy:.2f}")
Baseline majority-class accuracy: 0.62
Predict Simple Model¶
  • This baseline trains an XGBoostClassifier using only one-hot encoded Pclass_Sex features to evaluate their standalone predictive power.
  • A 5-fold stratified cross-validation setup is used to train and evaluate the model. Accuracy is calculated as the mean across folds, with the Kth fold used as "unseen" validation data each iteration.
  • In addition to model accuracy, we capture diagnostics to guide future iterations:
    • Learning Curve: Plots the model's accuracy in predicting the training set and the validation (unseen) set as it increases training set size. Helps quickly assess the extent to which the model is overfitting or underfitting the data.
    • Feature Importances: Reports which features the model relied on the most when making decisions.
    • Permutation Importances: Tests how much each feature affects model accuracy by randomly shuffling its values.

Summary Observations:

  • Average Accuracy of XGBClassifier with Pclass_Sex OHE Features Only: 77.4%
  • Learning Curve: Suggests model slight underfitting, as can be seen by training and validation accuracy decreasing in close proximity as training set size increases
  • Feature Importances:
    • Feature contributions are relatively balanced in that no feature is dominating the model's decision making at the expensive of others.
    • Most used feature is Pclass_Sex_3_female, used in 35.4% of model splits
    • Least used feature is Pclass_Sex_1_female, used in 5.7% of model splits
  • Permutation Importances:
    • Surprisingly, the Pclass_Sex_3_female feature -- the most used feature -- has the lowest Permutation Importance with a mean of 0.07%
    • This suggests the model may be overusing this feature despite its limited true value, possibly due to group imbalances or confounded patterns
In [336]:
baseline_input_features = pclass_sex_oh_cols
baseline_drop_cols = []
baseline_model, baseline_X_test_encoded, worst_fold_mistakes, X_train_encoded = iterate_model(
    baseline_input_features, 
    baseline_drop_cols, 
    feature_importances=True,
    permutation_importances=True,
    learning_curve=True
)
Evaluating XGBClassifier model with the following variables:
* Pclass_Sex_1_female
* Pclass_Sex_1_male
* Pclass_Sex_2_female
* Pclass_Sex_2_male
* Pclass_Sex_3_female
* Pclass_Sex_3_male

CV Scores: [0.7821229050279329, 0.7640449438202247, 0.7696629213483146, 0.7640449438202247, 0.7921348314606742]
Worst Fold Index: 1
Mean Accuracy: 0.7744
Standard Deviation: 0.0111
No description has been provided for this image
Estimator Feature Importances:

            Feature  Importance
Pclass_Sex_3_female    0.353995
  Pclass_Sex_3_male    0.291663
  Pclass_Sex_2_male    0.143335
Pclass_Sex_1_female    0.081116
Pclass_Sex_2_female    0.072465
  Pclass_Sex_1_male    0.057427
No description has been provided for this image
Permutation Importances:
               feature  importance_mean  importance_std
5    Pclass_Sex_3_male         0.168911        0.010942
3    Pclass_Sex_2_male         0.073850        0.004071
0  Pclass_Sex_1_female         0.046352        0.009629
2  Pclass_Sex_2_female         0.034007        0.010014
1    Pclass_Sex_1_male         0.031987        0.003339
4  Pclass_Sex_3_female         0.000673        0.007137

Features with correlation ≥ 0.85 and lower importance:

Found 0 feature(s) to drop.


Engineered Features Test¶

In [539]:
selected_feature_list = pclass_sex_oh_cols.tolist() + global_feature_cols + smoothed_feature_cols
selected_drop_cols = [
    # Baseline Accuracy before dropping: 0.8013 

    # Ablation Testing Round 1
    'global_Pclass_Title_normalized_smoothed',   # 0.8013 -> 0.8070
     #'global_Title_normalized_smoothed',        # 0.8070 -> 0.7879 (!)
     'Pclass_Sex_1_female',                      # 0.8070 -> 0.8070
     'Pclass_Sex_1_male',                        # 0.8070 -> 0.8070
     'Pclass_Sex_2_female',                      # 0.8070 -> 0.8070
     'Pclass_Sex_2_male',                        # 0.8070 -> 0.8070
     'Pclass_Sex_3_female',                      # 0.8070 -> 0.8070
     'Pclass_Sex_3_male',                        # 0.8070 -> 0.8070
     'P1_Female_overall_smoothed',               # 0.8070 -> 0.8103
     'P2_Male_overall_smoothed',                 # 0.8103 -> 0.8126
      #'P3_Female_overall_smoothed'              # 0.8126 -> 0.8070 (!)
      'P1_Male_overall_smoothed',                # 0.8126 -> 0.8137
      'P2_Male_Deck_bin_smoothed',               # 0.8137 -> 0.8137
      'P2_Male_Cabin_Location_s_smoothed',       # 0.8137 -> 0.8137
      #'global_Pclass_Cabin_Location_s_smoothed' # 0.8137 -> 0.8092 (!)
      #'global_Pclass_HasCabin_smoothed',        # 0.8137 -> 0.8126 (!)   
      #'P3_Male_Deck_bin_smoothed'               # 0.8137 -> 0.8047 (!)
     'P3_Female_overall_smoothed',               # 0.8137 -> 0.8148
     'P3_Male_overall_smoothed',                 # 0.8148 -> 0.8148
      #'global_Pclass_HasCabin_smoothed'         # 0.8148 -> 0.8104 (!)
      #'global_Pclass_Deck_bin_smoothed'         # 0.8148 -> 0.8059 (!)
      #'global_Sex_Parch_SibSp_bin_smoothed'     # 0.8148 -> 0.8126 (!)
      #'global_Sex_Embarked_smoothed'            # 0.8148 -> 0.8070 (!)
      #'P3_Female_Embarked_smoothed'             # 0.8148 -> 0.8103 (!)
      #'P3_Female_Title_normalized_smoothed'     # 0.8148 ->  0.8137 (!)
      #'P3_Female_Parch_SibSp_bin_smoothed'      # 0.8148 -> 0.8137
      #'P3_Male_Age_Group_smoothed'              # 0.8148 -> 0.8092
     'P3_Female_Age_Group_smoothed',             # 0.8148 -> 0.8171
     'P2_Male_HasCabin_smoothed',                # 0.8171 -> 0.8182
     'global_Pclass_Sex_smoothed',               # 0.8182 -> 0.8182
     'P2_Male_Age_Group_smoothed',               # Acc: 0.8182 -> 0.8159  Std: 0.0182 -> 0.0088
     'P3_Male_Deck_bin_smoothed',                # 0.8159 -> 0.8204
     #'global_Sex_HasCabin_smoothed'             # 0.8204 -> 0.8137 (!)
     #'P3_Male_Parch_SibSp_bin_smoothed'         # 0.8204 -> 0.8182 (!)
     #'P1_Female_Parch_SibSp_bin_smoothed'       # 0.8204 -> 0.8193 (!)
     #'P1_Female_Deck_bin_smoothed',             # 0.8204 -> 0.8182 (!)
     #'P1_Female_Age_Group_smoothed'             # 0.8204 -> 0.8103 (!)
     'P3_Male_Deck_bin_smoothed',                # 0.8204 -> 0.8204
     'P3_Male_Title_normalized_smoothed',        # 0.8204 -> 0.8216
     #'global_Deck_bin_smoothed',                # 0.8216 -> 0.8171
     #'global_Pclass_Embarked_smoothed'          # 0.8216 ->  0.8036
     #'global_Embarked_HasCabin_smoothed'        # 0.8216 -> 0.8159
     #'P2_Female_overall_smoothed'               # 0.8216 -> 0.8193

    # Ablation Testing Round 2
    'P1_Female_Parch_SibSp_bin_smoothed',
    'P1_Female_Deck_bin_smoothed',
    # * P1_Male_Age_Group_smoothed
    'P2_Male_Title_normalized_smoothed',
    'P2_Male_Parch_SibSp_bin_smoothed',
    'P3_Female_Parch_SibSp_bin_smoothed',
    # * P3_Female_Embarked_smoothed
     'P3_Female_Title_normalized_smoothed',
    # * P3_Male_Age_Group_smoothed
    'P3_Male_Parch_SibSp_bin_smoothed',
    'global_HasCabin_Parch_SibSp_bin_smoothed',
    'global_Pclass_Cabin_Location_s_smoothed',
    'global_Sex_HasCabin_smoothed',
    'global_Title_normalized_smoothed',
    'global_Sex_Embarked_smoothed',
    'global_Sex_Parch_SibSp_bin_smoothed',
    'global_Pclass_Parch_SibSp_bin_smoothed',
     #'global_Pclass_Deck_bin_smoothed',
    'global_Pclass_Embarked_smoothed',
    'global_Pclass_HasCabin_smoothed',
    'global_Embarked_HasCabin_smoothed',
    'global_Deck_bin_smoothed',
     #'global_Parch_SibSp_bin_smoothed',
     #'P2_Female_overall_smoothed',
     #'P1_Female_Age_Group_smoothed',
    'P1_Male_Age_Group_smoothed',
    #'P3_Female_Embarked_smoothed',
    #'P3_Male_Age_Group_smoothed' 
]  
submission_model, submission_X_test_selected, oof_preds, X_train_selected = iterate_model(selected_feature_list, selected_drop_cols, 
    feature_importances=True, permutation_importances=True, learning_curve=True,
    max_depth=3,
    min_child_weight=1,
    gamma=1,
    subsample=0.6,
    colsample_bytree=1,                                                                         
    learning_rate=0.01,
    n_estimators=250,
    reg_alpha=0,
    reg_lambda=1,
    eval_metric='error'
)
Dropping these variables from model input:
* global_Pclass_Title_normalized_smoothed
* Pclass_Sex_1_female
* Pclass_Sex_1_male
* Pclass_Sex_2_female
* Pclass_Sex_2_male
* Pclass_Sex_3_female
* Pclass_Sex_3_male
* P1_Female_overall_smoothed
* P2_Male_overall_smoothed
* P1_Male_overall_smoothed
* P2_Male_Deck_bin_smoothed
* P2_Male_Cabin_Location_s_smoothed
* P3_Female_overall_smoothed
* P3_Male_overall_smoothed
* P3_Female_Age_Group_smoothed
* P2_Male_HasCabin_smoothed
* global_Pclass_Sex_smoothed
* P2_Male_Age_Group_smoothed
* P3_Male_Deck_bin_smoothed
* P3_Male_Deck_bin_smoothed
* P3_Male_Title_normalized_smoothed
* P1_Female_Parch_SibSp_bin_smoothed
* P1_Female_Deck_bin_smoothed
* P2_Male_Title_normalized_smoothed
* P2_Male_Parch_SibSp_bin_smoothed
* P3_Female_Parch_SibSp_bin_smoothed
* P3_Female_Title_normalized_smoothed
* P3_Male_Parch_SibSp_bin_smoothed
* global_HasCabin_Parch_SibSp_bin_smoothed
* global_Pclass_Cabin_Location_s_smoothed
* global_Sex_HasCabin_smoothed
* global_Title_normalized_smoothed
* global_Sex_Embarked_smoothed
* global_Sex_Parch_SibSp_bin_smoothed
* global_Pclass_Parch_SibSp_bin_smoothed
* global_Pclass_Embarked_smoothed
* global_Pclass_HasCabin_smoothed
* global_Embarked_HasCabin_smoothed
* global_Deck_bin_smoothed
* P1_Male_Age_Group_smoothed

Evaluating XGBClassifier model with the following variables:
* global_Pclass_Deck_bin_smoothed
* global_Parch_SibSp_bin_smoothed
* P2_Female_overall_smoothed
* P1_Female_Age_Group_smoothed
* P3_Female_Embarked_smoothed
* P3_Male_Age_Group_smoothed

CV Scores: [0.8156424581005587, 0.8202247191011236, 0.7808988764044944, 0.8202247191011236, 0.8202247191011236]
Worst Fold Index: 2
Mean Accuracy: 0.8114
Standard Deviation: 0.0154
No description has been provided for this image
Estimator Feature Importances:

                        Feature  Importance
     P3_Male_Age_Group_smoothed    0.285831
     P2_Female_overall_smoothed    0.211786
   P1_Female_Age_Group_smoothed    0.199908
    P3_Female_Embarked_smoothed    0.128072
global_Pclass_Deck_bin_smoothed    0.093019
global_Parch_SibSp_bin_smoothed    0.081384
No description has been provided for this image
Permutation Importances:
                           feature  importance_mean  importance_std
2       P2_Female_overall_smoothed         0.079686        0.006696
3     P1_Female_Age_Group_smoothed         0.069809        0.006540
5       P3_Male_Age_Group_smoothed         0.045230        0.007360
4      P3_Female_Embarked_smoothed         0.030415        0.004747
0  global_Pclass_Deck_bin_smoothed         0.010550        0.003521
1  global_Parch_SibSp_bin_smoothed         0.005612        0.003367

Features with correlation ≥ 0.85 and lower importance:

Found 0 feature(s) to drop.


Hyperparameter Tuning¶

In [340]:
plot_validation_curve(submission_model, 'reg_lambda', [0, 1, 5, 10], selected_feature_list, selected_drop_cols)
Evaluating baseline model with the following variables:
* global_Pclass_Deck_bin_smoothed
* global_Parch_SibSp_bin_smoothed
* P2_Female_overall_smoothed
* P1_Female_Age_Group_smoothed
* P3_Female_Embarked_smoothed
* P3_Male_Age_Group_smoothed
No description has been provided for this image

Out-of-Fold Prediction Mistake Analysis¶

In [342]:
def compute_oof_subgroup_mistakes(oof_preds, y_true, group_col, train_df):
    """
    Computes number of mistakes per group based on out-of-fold predictions.

    Args:
        oof_preds (np.ndarray or pd.Series): Out-of-fold predicted labels (0/1).
        y_true (np.ndarray or pd.Series): Ground truth labels.
        group_col (str): Column name in train_df to group by (e.g., 'Pclass_Sex').
        train_df (pd.DataFrame): DataFrame that includes the group_col.

    Returns:
        pd.DataFrame: Mistake counts and mistake rates by group.
    """
    df = train_df.copy()
    df['y_true'] = y_true
    df['y_pred'] = oof_preds
    df['mistake'] = df['y_true'] != df['y_pred']

    result = (
        df.groupby(group_col)
          .agg(
              Mistake_Count=('mistake', 'sum'),
              Train_Count=(group_col, 'count')
          )
          .assign(Mistake_Rate=lambda d: d['Mistake_Count'] / d['Train_Count'])
          .sort_values('Mistake_Count', ascending=False)
    )

    return result
  • The following data counts the number of mistakes generated for each Pclass_Sex group found in the training set.
  • We see the most mistakes were made within the 3_female group, which is the 2nd largest group represented in the training set.
In [344]:
prepared_train_df['Pclass_Sex'] = (
    prepared_train_df['Pclass'].astype(str) + '_' + prepared_train_df['Sex'].astype(str)
)
oof_mistakes_df = compute_oof_subgroup_mistakes(
    oof_preds=oof_preds, 
    y_true=prepared_train_df['Survived'], 
    group_col="Pclass_Sex", 
    train_df=prepared_train_df
)
display(oof_mistakes_df)
Mistake_Count Train_Count Mistake_Rate
Pclass_Sex
3_female 48 144 0.333333
3_male 47 347 0.135447
1_male 44 122 0.360656
2_male 16 108 0.148148
1_female 7 94 0.074468
2_female 6 76 0.078947

SHAP Analysis of Mistakes¶

  • We use SHAP analysis to understand the magnitude and direction each feature contributes to the model's mistaken prediction of 3_female survival.
  • The analysis revealed the model used features that were not intended to affect 3_female survival.
  • The model used the following features to predict the survival of a 3_female, all irrelevant to the class:
    • P3_Male_Age Group_smoothed
    • P2_Female_overall_smoothed
    • P1_Female_Age_Group_smoothed
In [347]:
# Zoom in on 3_female mistakes
mask_3_female = (prepared_train_df['Pclass'] == 3) & (prepared_train_df['Sex'] == 'female')
X_3_female = X_train_selected[mask_3_female]
mistake_mask = (oof_preds != prepared_train_df['Survived']) & mask_3_female
X_3_female_mistakes = X_train_selected[mistake_mask]
explainer = shap.Explainer(submission_model)
X_shap_safe = X_3_female_mistakes.copy()
X_shap_safe.fillna(0.0, inplace=True)  # or 0.0, if neutral
shap_values = explainer(X_shap_safe)
In [348]:
shap.plots.beeswarm(shap_values, show=False)
plt.title("3_female SHAP Values")
plt.show()
No description has been provided for this image
In [349]:
shap.plots.waterfall(shap_values[0], show=False)
plt.title("3_female SHAP Mistake Example")
plt.show()
No description has been provided for this image
In [350]:
shap.plots.waterfall(shap_values[1], show=False)
plt.title("3_female SHAP Mistake Example")
plt.show()
No description has been provided for this image
In [351]:
shap.plots.waterfall(shap_values[2], show=False)
plt.title("3_female SHAP Mistake Example")
plt.show()
No description has been provided for this image
In [352]:
shap.plots.waterfall(shap_values[3], show=False)
plt.title("3_female SHAP Mistake Example")
plt.show()
No description has been provided for this image
In [353]:
shap.plots.waterfall(shap_values[4], show=False)
plt.title("3_female SHAP Mistake Example")
plt.show()
No description has been provided for this image

Submission¶

In [356]:
# Confirm submission input features
selected_feature_list
Out[356]:
['Pclass_Sex_1_female',
 'Pclass_Sex_1_male',
 'Pclass_Sex_2_female',
 'Pclass_Sex_2_male',
 'Pclass_Sex_3_female',
 'Pclass_Sex_3_male',
 'global_Pclass_Title_normalized_smoothed',
 'global_Pclass_Sex_smoothed',
 'global_Sex_HasCabin_smoothed',
 'global_Title_normalized_smoothed',
 'global_Sex_Embarked_smoothed',
 'global_Sex_Parch_SibSp_bin_smoothed',
 'global_Pclass_Parch_SibSp_bin_smoothed',
 'global_Pclass_Deck_bin_smoothed',
 'global_HasCabin_Parch_SibSp_bin_smoothed',
 'global_Pclass_Embarked_smoothed',
 'global_Pclass_Cabin_Location_s_smoothed',
 'global_Pclass_HasCabin_smoothed',
 'global_Embarked_HasCabin_smoothed',
 'global_Deck_bin_smoothed',
 'global_Parch_SibSp_bin_smoothed',
 'P1_Male_overall_smoothed',
 'P2_Male_overall_smoothed',
 'P3_Male_overall_smoothed',
 'P1_Female_overall_smoothed',
 'P2_Female_overall_smoothed',
 'P3_Female_overall_smoothed',
 'P1_Female_Age_Group_smoothed',
 'P1_Female_Parch_SibSp_bin_smoothed',
 'P1_Female_Deck_bin_smoothed',
 'P1_Male_Age_Group_smoothed',
 'P2_Male_Age_Group_smoothed',
 'P2_Male_Title_normalized_smoothed',
 'P2_Male_Cabin_Location_s_smoothed',
 'P2_Male_Parch_SibSp_bin_smoothed',
 'P2_Male_Deck_bin_smoothed',
 'P2_Male_HasCabin_smoothed',
 'P3_Female_Parch_SibSp_bin_smoothed',
 'P3_Female_Embarked_smoothed',
 'P3_Female_Age_Group_smoothed',
 'P3_Female_Title_normalized_smoothed',
 'P3_Male_Age_Group_smoothed',
 'P3_Male_Title_normalized_smoothed',
 'P3_Male_Deck_bin_smoothed',
 'P3_Male_Parch_SibSp_bin_smoothed']
In [357]:
selected_drop_cols
Out[357]:
['global_Pclass_Title_normalized_smoothed',
 'Pclass_Sex_1_female',
 'Pclass_Sex_1_male',
 'Pclass_Sex_2_female',
 'Pclass_Sex_2_male',
 'Pclass_Sex_3_female',
 'Pclass_Sex_3_male',
 'P1_Female_overall_smoothed',
 'P2_Male_overall_smoothed',
 'P1_Male_overall_smoothed',
 'P2_Male_Deck_bin_smoothed',
 'P2_Male_Cabin_Location_s_smoothed',
 'P3_Female_overall_smoothed',
 'P3_Male_overall_smoothed',
 'P3_Female_Age_Group_smoothed',
 'P2_Male_HasCabin_smoothed',
 'global_Pclass_Sex_smoothed',
 'P2_Male_Age_Group_smoothed',
 'P3_Male_Deck_bin_smoothed',
 'P3_Male_Deck_bin_smoothed',
 'P3_Male_Title_normalized_smoothed',
 'P1_Female_Parch_SibSp_bin_smoothed',
 'P1_Female_Deck_bin_smoothed',
 'P2_Male_Title_normalized_smoothed',
 'P2_Male_Parch_SibSp_bin_smoothed',
 'P3_Female_Parch_SibSp_bin_smoothed',
 'P3_Female_Title_normalized_smoothed',
 'P3_Male_Parch_SibSp_bin_smoothed',
 'global_HasCabin_Parch_SibSp_bin_smoothed',
 'global_Pclass_Cabin_Location_s_smoothed',
 'global_Sex_HasCabin_smoothed',
 'global_Title_normalized_smoothed',
 'global_Sex_Embarked_smoothed',
 'global_Sex_Parch_SibSp_bin_smoothed',
 'global_Pclass_Parch_SibSp_bin_smoothed',
 'global_Pclass_Embarked_smoothed',
 'global_Pclass_HasCabin_smoothed',
 'global_Embarked_HasCabin_smoothed',
 'global_Deck_bin_smoothed',
 'P1_Male_Age_Group_smoothed']
In [358]:
submission_X_test_selected.isnull().sum().loc[lambda x: x > 0]
Out[358]:
Series([], dtype: int64)
In [359]:
# Confirm no unexpected columns
submission_X_test_selected.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 6 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   global_Pclass_Deck_bin_smoothed  418 non-null    float64
 1   global_Parch_SibSp_bin_smoothed  418 non-null    float64
 2   P2_Female_overall_smoothed       418 non-null    float64
 3   P1_Female_Age_Group_smoothed     418 non-null    float64
 4   P3_Female_Embarked_smoothed      418 non-null    float64
 5   P3_Male_Age_Group_smoothed       418 non-null    float64
dtypes: float64(6)
memory usage: 19.7 KB
In [360]:
# Confirm submission model configuration
submission_model
Out[360]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None, colsample_bytree=1,
              device=None, early_stopping_rounds=None, enable_categorical=False,
              eval_metric='error', feature_types=None, feature_weights=None,
              gamma=1, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=3, max_leaves=None,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=250, n_jobs=None,
              num_parallel_tree=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None, colsample_bytree=1,
              device=None, early_stopping_rounds=None, enable_categorical=False,
              eval_metric='error', feature_types=None, feature_weights=None,
              gamma=1, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=3, max_leaves=None,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=250, n_jobs=None,
              num_parallel_tree=None, ...)
In [546]:
submission = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': submission_model.predict(submission_X_test_selected)
})
timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
filename = f'./submission-{timestamp}.csv'
submission.to_csv(filename, index=False)

References¶

  • (1) "Titanic Deckplans." Encyclopedia Titanica, https://www.encyclopedia-titanica.org/titanic-deckplans/

Table of Contents Generator¶

In [366]:
import json
import re

def slugify(text):
    text = text.strip()
    text = re.sub(r'[^\w\s\-]', '', text)  # allow (), - and alphanumerics
    return re.sub(r'[\s]+', '-', text)

def extract_headings(ipynb_path):
    with open(ipynb_path, 'r', encoding='utf-8') as f:
        nb = json.load(f)

    toc_lines = ["## Table of Contents\n"]
    for cell in nb['cells']:
        if cell['cell_type'] == 'markdown':
            for line in cell['source']:
                match = re.match(r'^(#{2,6})\s+(.*)', line)
                if match:
                    level = len(match.group(1)) - 1  # offset for nesting
                    title = match.group(2).strip()
                    anchor = slugify(title)
                    indent = '    ' * (level - 1)
                    toc_lines.append(f"{indent}1. [{title}](#{anchor})")

    return '\n'.join(toc_lines)

# Example usage:
toc = extract_headings("titanic_ml_from_disaster__paultongyoo.ipynb")
print(toc)
## Table of Contents

1. [Table of Contents](#Table-of-Contents)
1. [Project Summary](#Project-Summary)
    1. [What I Did](#What-I-Did)
    1. [What I Learned](#What-I-Learned)
    1. [What's Next](#Whats-Next)
1. [Introduction](#Introduction)
1. [Methodology](#Methodology)
    1. [Data Understanding](#Data-Understanding)
        1. [Data Dictionary](#Data-Dictionary)
        1. [Variable Notes](#Variable-Notes)
        1. [Descriptive Statistics](#Descriptive-Statistics)
        1. [Row Samples](#Row-Samples)
        1. [Data Types](#Data-Types)
        1. [Missing Values Summary](#Missing-Values-Summary)
    1. [Data Preparation](#Data-Preparation)
        1. [Missing Value Imputation](#Missing-Value-Imputation)
            1. [Embarked](#Embarked)
            1. [Cabin](#Cabin)
            1. [Age](#Age)
            1. [Fare](#Fare)
    1. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
        1. [Target](#Target)
        1. [Individual Features x Target](#Individual-Features-x-Target)
            1. [Pclass](#Pclass)
            1. [Sex](#Sex)
            1. [SibSp](#SibSp)
            1. [Parch](#Parch)
            1. [Embarked_](#Embarked_)
            1. [HasCabin](#HasCabin)
            1. [Cabin_count](#Cabin_count)
            1. [Cabin_Location_s](#Cabin_Location_s)
            1. [Deck](#Deck)
            1. [Title](#Title)
            1. [Age_](#Age_)
            1. [Age_Group](#Age_Group)
            1. [Fare_](#Fare_)
            1. [Summary of Single Feature Relationship with Target](#Summary-of-Single-Feature-Relationship-with-Target)
        1. [Composite Feature x Target](#Composite-Feature-x-Target)
            1. [Pclass x Sex](#Pclass-x-Sex)
            1. [Pclass x Title](#Pclass-x-Title)
            1. [Pclass x Parch](#Pclass-x-Parch)
            1. [Pclass x SibSp](#Pclass-x-SibSp)
            1. [Sex x Parch](#Sex-x-Parch)
            1. [Sex x SibSp](#Sex-x-SibSp)
            1. [Pclass x Embarked](#Pclass-x-Embarked)
            1. [Sex x Embarked](#Sex-x-Embarked)
            1. [Pclass x HasCabin](#Pclass-x-HasCabin)
            1. [Sex x HasCabin](#Sex-x-HasCabin)
            1. [Parch x HasCabin](#Parch-x-HasCabin)
            1. [SibSp x HasCabin](#SibSp-x-HasCabin)
            1. [Embarked x HasCabin](#Embarked-x-HasCabin)
            1. [Pclass x Cabin_count](#Pclass-x-Cabin_count)
            1. [Sex x Cabin_count](#Sex-x-Cabin_count)
            1. [Pclass x Cabin_Location_s](#Pclass-x-Cabin_Location_s)
            1. [Sex x Cabin_Location_s](#Sex-x-Cabin_Location_s)
            1. [Pclass x Deck_bin](#Pclass-x-Deck_bin)
            1. [Sex x Deck_bin](#Sex-x-Deck_bin)
            1. [Parch x Deck_bin](#Parch-x-Deck_bin)
            1. [SibSp x Deck_bin](#SibSp-x-Deck_bin)
            1. [Deck x Cabin_Location_s](#Deck-x-Cabin_Location_s)
            1. [Pclass x Title_bin](#Pclass-x-Title_bin)
            1. [Sex x Title_bin](#Sex-x-Title_bin)
            1. [Pclass x Age_Group](#Pclass-x-Age_Group)
            1. [Sex x Age_Group](#Sex-x-Age_Group)
            1. [Pclass x FPP_log_bin](#Pclass-x-FPP_log_bin)
            1. [Sex x FPP_log_bin](#Sex-x-FPP_log_bin)
            1. [Pclass x Parch_SibSp](#Pclass-x-Parch_SibSp)
            1. [Sex x Parch_SibSp](#Sex-x-Parch_SibSp)
            1. [HasCabin x Parch_SibSp](#HasCabin-x-Parch_SibSp)
        1. [Hi-Cardinality Features](#Hi-Cardinality-Features)
            1. [Ticket](#Ticket)
        1. [Feature Priority Based on EDA](#Feature-Priority-Based-on-EDA)
    1. [Cross-Fold Distribution Shift Analysis](#Cross-Fold-Distribution-Shift-Analysis)
    1. [Feature Engineering](#Feature-Engineering)
        1. [Reduce Distribution Shift of Select Features](#Reduce-Distribution-Shift-of-Select-Features)
            1. [Pclass x Age_Group](#Pclass-x-Age_Group)
            1. [Pclass_HasCabin](#Pclass_HasCabin)
            1. [Sex x HasCabin](#Sex-x-HasCabin)
            1. [Embarked x HasCabin](#Embarked-x-HasCabin)
            1. [Parch_SibSp_bin](#Parch_SibSp_bin)
            1. [HasCabin x Parch_SibSp_bin](#HasCabin-x-Parch_SibSp_bin)
            1. [Pclass x Parch_SibSp_bin](#Pclass-x-Parch_SibSp_bin)
            1. [Sex x Parch_SibSp_bin](#Sex-x-Parch_SibSp_bin)
            1. [Pclass x Embarked](#Pclass-x-Embarked)
            1. [Sex x Embarked](#Sex-x-Embarked)
            1. [Pclass x Deck_bin](#Pclass-x-Deck_bin)
            1. [Pclass x Cabin_Location_s](#Pclass-x-Cabin_Location_s)
            1. [Pclass x Normalized Title](#Pclass-x-Normalized-Title)
            1. [Deck_bin](#Deck_bin)
            1. [Title_normalized](#Title_normalized)
        1. [Pclass_Sex One-Hot Encodings](#Pclass_Sex-One-Hot-Encodings)
        1. [Survival Association Tests](#Survival-Association-Tests)
            1. [Global Feature Survival Association Tests](#Global-Feature-Survival-Association-Tests)
            1. [Pclass x Sex Subgroup Feature Survival Association Tests](#Pclass-x-Sex-Subgroup-Feature-Survival-Association-Tests)
            1. [Survival Association Test Strategy and Results](#Survival-Association-Test-Strategy-and-Results)
        1. [Smoothed Survival Rate Feature Engineering](#Smoothed-Survival-Rate-Feature-Engineering)
            1. [Generate Global Smoothed Features](#Generate-Global-Smoothed-Features)
            1. [Is_Shared_Ticket](#Is_Shared_Ticket)
    1. [Model Development](#Model-Development)
        1. [Baseline Establishment](#Baseline-Establishment)
            1. [Predict Majority Class](#Predict-Majority-Class)
            1. [Predict Simple Model](#Predict-Simple-Model)
        1. [Engineered Features Test](#Engineered-Features-Test)
    1. [Hyperparameter Tuning](#Hyperparameter-Tuning)
        1. [Out-of-Fold Prediction Mistake Analysis](#Out-of-Fold-Prediction-Mistake-Analysis)
        1. [SHAP Analysis of Mistakes](#SHAP-Analysis-of-Mistakes)
1. [Submission](#Submission)
1. [References](#References)
        1. [Table of Contents Generator](#Table-of-Contents-Generator)